Closed Xiaoming-Zhao closed 6 days ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 45.22%. Comparing base (
29c0ed6
) to head (db65caa
). Report is 3 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Hi @Xiaoming-Zhao,
Thanks for this PR!
1.) I'm pretty sure that using infiniteloop
is correct, and hence does not need to be replaced (though that doesn't mean we cannot do it). In issue #144, you wrote:
I also noticed that the use sampler.set_epoch(epoch). Based on my previous experience, this is crucial to ensure randomness across epochs. However, with he current generator provided by infiniteloop, I am not sure whether the set_epoch will actually affect the dataloader , I need to double check.
I wrote a small script that I uploaded on my website (https://imahnshekhzadeh.github.io/#Blog), which uses an infiniteloop
dataloader, and which I experimented extensively. I ran it like this:
torchrun --nproc_per_node=NUM_GPUS_YOU_HAVE test_inf_loop.py --master_addr [...] --master_port [...]
# e.g.: `torchrun --nproc_per_node=2 test_inf_loop.py --master_addr [...] --master_port [...]`
When sampler.set_epoch(epoch)
is used, we observe:
Rank: 1, Epoch: 0, Batch: 2, Data:
[tensor([[-1.3042, -1.1097],
[-0.1320, -0.2751]])]
Rank: 1, Epoch: 1, Batch: 2, Data:
[tensor([[-0.1752, 0.6990],
[-0.2350, 0.0937]])]
So clearly, sampler.set_epoch(epoch)
does its job! However, when commenting out the two lines, we see this in the output:
Rank: 1, Epoch: 0, Batch: 2, Data:
[tensor([[-1.3042, -1.1097],
[-0.1320, -0.2751]])]
Rank: 1, Epoch: 1, Batch: 2, Data:
[tensor([[-1.3042, -1.1097],
[-0.1320, -0.2751]])]
Clearly, no shuffling happened!
2.) About the --standalone
flag: Can you please open an issue for this, since this is unrelated to the infiniteloop
discussion?Thanks!
Thanks for the detailed example, @ImahnShekhzadeh! This is incredibly helpful. Lessons learned.
I will close this PR for now as it seems like all required changes have been implemented in #116.
Regarding the torchrun
command, I create a separate PR in #149
What does this PR do?
This PR avoids using infinite generator provided by
infiniteloop
and directly use thedataloader
instead as discussed in #144. This change follows the structure provided bypytorch
.I tested the change locally and make sure that it can run smoothly.
I also added a
--standalone
command line argument in README, without which I cannot make the script run. This argument is also provided by the official example for single node usage.Before submitting
pytest
command?pre-commit run -a
command?Did you have fun?
Make sure you had fun coding 🙃