Distributed training not using all available GPUs on large dataset

ritamyhuang commented 1 month ago

Hi,

Thank you for the great package! I am using the example notebook: examples/notebooks/scRNAseq_MetricEmbedding.ipynb to train a dataset with 100k data points. In the notebook, the training process is set to be computed distributed with nn.DataParallel, and there are 4 GPUs available. However, I have been getting this CUDA out of memory error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 1.33 GiB. GPU 0 has a total capacty of 14.58 GiB of which 1.01 GiB is free. Process 12957 has 13.57 GiB memory in use. Of the allocated memory 12.70 GiB is allocated by PyTorch, and 9.80 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Upon checking the GPU memory usage, it appears that only GPU 0 is being used and other 3 GPUs are unused. I think this indicates that the training process is not distributed over all GPUs. Could I get some help solving this out of memory error? Thank you!

KevinMusgrave commented 1 month ago

Sorry for the late reply. Perhaps the distributed training notebook would be helpful? https://github.com/KevinMusgrave/pytorch-metric-learning/blob/master/examples/notebooks/DistributedTripletMarginLossMNIST.ipynb

ritamyhuang commented 1 month ago

I found out that the issue is solved by changing this line model = nn.DataParallel(model).to(device) in the notebook to these two lines model = nn.DataParallel(model) model = model.to(device)

Thank you for assisting!

KevinMusgrave / pytorch-metric-learning

Distributed training not using all available GPUs on large dataset #700