Closed prakhar6sharma closed 1 year ago
Update on this.
I have spent almost a week on this and there is some deadlock condition happening when converting the xarray
data to torch.tensor
(See this
). I just can't figure out a way to fix this for now.
Currently DDP
works for multiple gpu only if you set num_workers
in the DataLoader
to 0. DDP_spawn
works although for multi-gpu and with num_workers > 0
but the slight performance improvement of using multiple workers is quickly overshadowed by the performance degradement of using .spawn()
for process creation.
For my purposes having zero workers with multiple gpu work perfectly as it also temporarily avoids #89 (due to just one main process which is stateful).
Describe the bug ShardDataset doesn't work with
DDP
but works withDDP_spawn
. The training just hangs before the start of the first epoch.