AutoResearch / EEG-GAN

Other
19 stars 1 forks source link

GAN Training script does not close everything properly #62

Closed whyhardt closed 3 months ago

whyhardt commented 8 months ago

Running gan training several times in a row seems to deplete some resources Exact steps were:

  1. Running DDP Training on Oscar for a few epochs
  2. KeyboardInterrupt
  3. Re-running training
  4. Estimated time became twice as high
whyhardt commented 3 months ago

Could not reproduce on single GPU (non-DDP training) Two approaches; Check each individually:

  1. added a try/except clause at DDP Training level to destroy the process group if an exception occurs
  2. added a torch.cuda.empty_cache() call before each training run to clear the cache
chadcwilliams commented 3 months ago

In my most recent trainings, I also didn't see this issue emerge. I will try to reproduce it explicitly with GPUs on Oscar and on Google Colab. If I can't reproduce it, I'll close the issue.

chadcwilliams commented 3 months ago

Seems fixed now (could not reproduce the issue).