GAN Training script does not close everything properly

AutoResearch / EEG-GAN

Other

19 stars 1 forks source link

GAN Training script does not close everything properly #62

Closed whyhardt closed 3 months ago

whyhardt commented 8 months ago

Running gan training several times in a row seems to deplete some resources Exact steps were:

Running DDP Training on Oscar for a few epochs
KeyboardInterrupt
Re-running training
Estimated time became twice as high

whyhardt commented 3 months ago

Could not reproduce on single GPU (non-DDP training) Two approaches; Check each individually:

added a try/except clause at DDP Training level to destroy the process group if an exception occurs
added a torch.cuda.empty_cache() call before each training run to clear the cache

chadcwilliams commented 3 months ago

In my most recent trainings, I also didn't see this issue emerge. I will try to reproduce it explicitly with GPUs on Oscar and on Google Colab. If I can't reproduce it, I'll close the issue.

chadcwilliams commented 3 months ago

Seems fixed now (could not reproduce the issue).