fastai / fastai

The fastai deep learning library
http://docs.fast.ai
Apache License 2.0
26.19k stars 7.55k forks source link

DistributedDL missing idxs #3579

Open kevinbird15 opened 2 years ago

kevinbird15 commented 2 years ago

Please confirm you have the latest versions of fastai, fastcore, and nbdev prior to reporting a bug (delete one): YES

Describe the bug I believe DistributedDL is currently not behaving properly on non-divisible lengths

To Reproduce Steps to reproduce the behavior:

  1. Run the tests in 20a_distributed
  2. Look at the output from the following test
    #hide
    dl = TfmdDL(list(range(50)), bs=12, num_workers=2,drop_last=True)
    for i in range(4):
    dl1 = DistributedDL(dl, i, 4)
    test_eq(list(dl1), [torch.arange(i*13, i*13+12)%50])

Expected behavior I would expect if we have a dataloader from 0-49 with batch size of 12, we would distribute that into 4 dataloaders.
dl0 = items 0-11 dl1 = items 12-23 dl2 = items 24-35 dl3 = items 36-47

#current check that fails with new code update
for i in range(4):
    print([torch.arange(i*13, i*13+12)%50])

Current Output

[tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])]
[tensor([13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24])]
[tensor([26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37])]
[tensor([39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,  0])]
#proposed check
for i in range(4):
    print([torch.arange(i*12, i*12+12)%50])

Proposed Output:

[tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])]
[tensor([12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23])]
[tensor([24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35])]
[tensor([36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47])]

Additional context This came up when fixing an issue with Learner.get_preds where the dl.get_idxs was allowing items from outside the last batch to be displayed still even if drop_last=True which resulted in an index issue. After fixing that, this 20a_distributed test started failing and after digging in a bit, I believe the new behavior should be correct.

tyoc213 commented 2 years ago

If the other fix is right, this test would pass with test_eq(list(dl1), [torch.arange(i*12, i*12+12)%50])

tmabraham commented 2 years ago

@kevinbird15 is this fixed yet? I don't think so since the test is the same and not updated... What needs to be fixed here?

kevinbird15 commented 2 years ago

I think the question of whether there is anything that should be fixed. I am still unsure what the expected behavior is supposed to be since I am not using DistributedDL. So I think the first thing to answer is what the expected behavior of fastai is in the scenario outlined above. I think @marii-moe had the opinion that the current behavior is expected so there isn't really a bug. I think at least it would need to be compared to see if it actually improves performance to change this or if there are times where one behavior is desired over the other.