BiomedSciAI / fuse-med-ml

A python framework accelerating ML based discovery in the medical field by encouraging code reuse. Batteries included :)
Apache License 2.0
137 stars 34 forks source link

Caching should only be done on rank 0 #284

Open shatz01 opened 1 year ago

shatz01 commented 1 year ago

Describe the bug\ Caching a dataset when running a DDP job breaks.

Solution\ Guard any caching with checking for environment variable os.environ["RANK"].

shatz01 commented 1 year ago

An intermittent fix is simply to first run your training script on 1 gpu, and once it has cached run it DDP with reset_cache=False.