Avoid using `infiniteloop` in `train_cifar10_ddp.py`

Xiaoming-Zhao commented 1 week ago

What does this PR do?

This PR avoids using infinite generator provided by infiniteloop and directly use the dataloader instead as discussed in #144. This change follows the structure provided by pytorch.

I tested the change locally and make sure that it can run smoothly.

I also added a --standalone command line argument in README, without which I cannot make the script run. This argument is also provided by the official example for single node usage.

Before submitting

[x] Did you make sure title is self-explanatory and the description concisely explains the PR?
[x] Did you make sure your PR does only one thing, instead of bundling different changes together?
[ ] Did you list all the breaking changes introduced by this pull request?
[ ] Did you test your PR locally with pytest command?
[ ] Did you run pre-commit hooks with pre-commit run -a command?

Did you have fun?

Make sure you had fun coding 🙃

codecov[bot] commented 1 week ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 45.22%. Comparing base (29c0ed6) to head (db65caa). Report is 3 commits behind head on main.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #145 +/- ## =========================================== + Coverage 34.23% 45.22% +10.98% =========================================== Files 55 12 -43 Lines 6268 1130 -5138 =========================================== - Hits 2146 511 -1635 + Misses 4122 619 -3503 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ImahnShekhzadeh commented 6 days ago

Hi @Xiaoming-Zhao,

Thanks for this PR!

1.) I'm pretty sure that using infiniteloop is correct, and hence does not need to be replaced (though that doesn't mean we cannot do it). In issue #144, you wrote:

I also noticed that the use sampler.set_epoch(epoch). Based on my previous experience, this is crucial to ensure randomness across epochs. However, with he current generator provided by infiniteloop, I am not sure whether the set_epoch will actually affect the dataloader , I need to double check.

I wrote a small script that I uploaded on my website (https://imahnshekhzadeh.github.io/#Blog), which uses an infiniteloop dataloader, and which I experimented extensively. I ran it like this:

torchrun --nproc_per_node=NUM_GPUS_YOU_HAVE test_inf_loop.py --master_addr [...] --master_port [...]
# e.g.: `torchrun --nproc_per_node=2 test_inf_loop.py --master_addr [...] --master_port [...]`

When sampler.set_epoch(epoch) is used, we observe:

Rank: 1, Epoch: 0, Batch: 2, Data:
[tensor([[-1.3042, -1.1097],
        [-0.1320, -0.2751]])]

Rank: 1, Epoch: 1, Batch: 2, Data:
[tensor([[-0.1752,  0.6990],
        [-0.2350,  0.0937]])]

So clearly, sampler.set_epoch(epoch) does its job! However, when commenting out the two lines, we see this in the output:

Rank: 1, Epoch: 0, Batch: 2, Data:
[tensor([[-1.3042, -1.1097],
        [-0.1320, -0.2751]])]

Rank: 1, Epoch: 1, Batch: 2, Data:
[tensor([[-1.3042, -1.1097],
        [-0.1320, -0.2751]])]

Clearly, no shuffling happened!

2.) About the --standalone flag: Can you please open an issue for this, since this is unrelated to the infiniteloop discussion?Thanks!

Xiaoming-Zhao commented 6 days ago

Thanks for the detailed example, @ImahnShekhzadeh! This is incredibly helpful. Lessons learned.

I will close this PR for now as it seems like all required changes have been implemented in #116.

Regarding the torchrun command, I create a separate PR in #149

atong01 / conditional-flow-matching