aditya-grover / climate-learn

Source code for ClimateLearn
MIT License
302 stars 49 forks source link

ShardDataset doesn't work for DDP #91

Closed prakhar6sharma closed 1 year ago

prakhar6sharma commented 1 year ago

Describe the bug ShardDataset doesn't work with DDP but works with DDP_spawn. The training just hangs before the start of the first epoch.

prakhar6sharma commented 1 year ago

Update on this.

I have spent almost a week on this and there is some deadlock condition happening when converting the xarray data to torch.tensor(See this). I just can't figure out a way to fix this for now.

Currently DDP works for multiple gpu only if you set num_workers in the DataLoader to 0. DDP_spawn works although for multi-gpu and with num_workers > 0 but the slight performance improvement of using multiple workers is quickly overshadowed by the performance degradement of using .spawn() for process creation.

For my purposes having zero workers with multiple gpu work perfectly as it also temporarily avoids #89 (due to just one main process which is stateful).