Open wlu1998 opened 1 month ago
Hi @wlu1998 , have you tried running this on a single GPU?
Also, can you post the output of nvidia-smi
command here?
Due to device limitations, running on a single GPU may result in “out of memory” issues.
For the default configs, you need a GPU with at least 80GB of memory.
Also, GraphCast only supports distributed data parallelism, so your GPU memory requirement does not reduce if you run on multiple GPUs.
I suggest to try getting this to work on a single GPU for now.
You can reduce the size of the model and the memory overhead by changing some configs like processor_layers
, hidden_dim
, mesh_level
: https://github.com/NVIDIA/modulus/blob/main/examples/weather/graphcast/conf/config.yaml#L29
My device is 24*4,so i can only run with with multi-GPU runs. And i want to solve the issue of "Some NCCL operations have failed or timed out". Thank u for the suggest, i will try. Actually, I have tried to modify the annual data to the daily data.
@wlu1998 If this issue is resolved, let us know so we can close the issue.
@wlu1998 If this issue is resolved, let us know so we can close the issue.
resolved
@wlu1998 If this issue is resolved, let us know so we can close the issue.
resolved
HELP! Could you tell me how to resolve?
Version
24.07
On which installation method(s) does this occur?
Docker
Describe the issue
when i traingraphcast on a single player with multiple cards,encounter the following error message: [rank3]:[E1016 03:33:45.317536384 ProcessGroupNCCL.cpp:568] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out. [rank3]:[E1016 03:33:45.319844411 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49. [rank3]:[E1016 03:33:45.319862434 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 3] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49. [rank3]:[E1016 03:33:45.319870659 ProcessGroupNCCL.cpp:582] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank2]:[E1016 03:33:45.380539738 ProcessGroupNCCL.cpp:568] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600076 milliseconds before timing out. [rank2]:[E1016 03:33:45.382855328 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49. [rank2]:[E1016 03:33:45.382875465 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 2] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49. [rank2]:[E1016 03:33:45.382884031 ProcessGroupNCCL.cpp:582] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E1016 03:33:45.400176821 ProcessGroupNCCL.cpp:568] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600095 milliseconds before timing out. [rank1]:[E1016 03:33:45.401444623 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49. [rank1]:[E1016 03:33:45.401455763 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 1] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49. [rank1]:[E1016 03:33:45.401460371 ProcessGroupNCCL.cpp:582] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank0]:[E1016 03:34:00.349118707 ProcessGroupNCCL.cpp:568] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLTOALL, NumelIn=50507776, NumelOut=50507776, Timeout(ms)=600000) ran for 600018 milliseconds before timing out. [rank0]:[E1016 03:34:00.351482167 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49. [rank0]:[E1016 03:34:00.351502204 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 0] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49. [rank0]:[E1016 03:34:00.351511431 ProcessGroupNCCL.cpp:582] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank0]:[E1016 03:52:46.582729551 ProcessGroupNCCL.cpp:1304] [PG 0 Rank 0] First PG on this rank that detected no heartbeat of its watchdog. [rank0]:[E1016 03:52:46.582797502 ProcessGroupNCCL.cpp:1342] [PG 0 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList.size()=0 [rank0]:[F1016 04:02:46.583303966 ProcessGroupNCCL.cpp:1168] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLEMONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList.size() = 0 I added the following code to the environment variables: export NCCL_IB_DISABLE=1 export TORCH_NCCL_ENABLE_MONITORING=0 but it didn't solve the problem
Minimum reproducible example
No response
Relevant log output
No response
Environment details
docker run -dit --restart always --shm-size 64G --ulimit memlock=-1 --ulimit stack=67108864 --runtime nvidia -v /data/modulus:/data/modulus --name modulus --gpus all nvcr.io/nvidia/modulus/modulus:24.07 /bin/bash