Closed rafathasan closed 8 months ago
@rafathasan Sorry for the late response. Have you tried setting CUDA_LAUNCH_BLOCKING=1
as suggested in the error message? The error could be misleading and hiding the actual error message. It is also possible that this could have been OOM (out of memory).
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!
@rafathasan Sorry for the late response. Have you tried setting
CUDA_LAUNCH_BLOCKING=1
as suggested in the error message? The error could be misleading and hiding the actual error message. It is also possible that this could have been OOM (out of memory).
Thank you, reducing the eval batch size solved my problems.
Bug description
Title: "TORCH_USE_CUDA_DSA" error when using SyncBatchNorm and DDP with ModelCheckpoint in multi-GPU semantic segmentation training
Description:
I am training a multi-model for semantic segmentation using SyncBatchNorm and DDP with 1 node and 8 GPUs. The training works fine without any errors, but when I plug in ModelCheckpoint in trainer's callbacks, it gives a "TORCH_USE_CUDA_DSA" error at arbitrary epochs.
Steps to reproduce:
Todo
Expected behavior:
ModelCheckpoint should save the model's state without any errors.
Actual behavior:
ModelCheckpoint saves the last model's state and "TORCH_USE_CUDA_DSA" error is raised at arbitrary epochs when ModelCheckpoint is used.
Reproducibility:
Always
This what I did to try to avoid the error but failed:
yaml config
train.py
pl.Module
Error messages and logs
Environment
cc @justusschock @awaelchli