Closed jinfeng-data closed 2 weeks ago
Can you link the error that you have observed? Are you training on a single node or multiple nodes? What you need to do is to be able to set all of the required environement variables outlined in the code. I recommend you speak to your system administrator in case you need more detailed help.
Thanks for your prompt reply. I am first training on a single node. The error message is as the following,
2024-08-25 22:27:52.631 INFO: Using gradient clipping with tolerance=10.000
2024-08-25 22:27:52.631 INFO: Started training
2024-08-25 22:28:16.122 INFO: Epoch None: loss=720.6777, MAE_E_per_atom=3672.0 meV, MAE_F=227.3 meV / A
2024-08-25 22:41:15.636 INFO: Epoch 0: loss=2.2208, MAE_E_per_atom=35.8 meV, MAE_F=41.9 meV / A
[rank0]:[E825 22:53:11.599226728 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3965, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933807 milliseconds before timing out.
[rank1]:[E825 22:53:11.599481332 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3965, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933855 milliseconds before timing out.
[rank1]:[E825 22:53:11.679520115 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 3965, last enqueued NCCL work: 10659, last completed NCCL work: 3964.
[rank0]:[E825 22:53:11.679530396 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 3965, last enqueued NCCL work: 10659, last completed NCCL work: 3964.
[rank0]:[E825 22:53:11.203585637 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 3965, last enqueued NCCL work: 10659, last completed NCCL work: 3964.
[rank0]:[E825 22:53:11.203651114 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E825 22:53:11.203671433 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E825 22:53:11.206360040 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 3965, last enqueued NCCL work: 10659, last completed NCCL work: 3964.
[rank1]:[E825 22:53:11.206422589 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E825 22:53:11.206441064 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E825 22:53:11.224901535 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3965, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933807 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x2b83bb9ebf86 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x2b83878628d2 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x2b8387869313 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x2b838786b6fc in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4:
Failures:
What GPUs are you using? You need to ask your system administrator provide ports that are accessible for NCCL to communicate between GPUs. Did you make sure to provide the right environement variables?
The training fails after Epoch 0.
I am training on A100.
My training script is like:
#
#
cd $PBS_O_WORKDIR
export CUDA_HOME=/public/software/compiler/cuda-11.3 export PATH=$PATH:$CUDA_HOME/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64
torchrun --standalone --nnodes=1 --nproc_per_node=2 /public/home/xiaohe/jinfeng/soft/mace-venv/bin/mace_run_train \ --name="haha" \ --foundation_model="./2023-12-03-mace-128-L1_epoch-199.model" \ --train_file="mace_trainingset.xyz" \ --valid_fraction=0.05 \ --E0s="isolated" \ --forces_weight=1000 \ --energy_weight=100 \ --lr=0.01 \ --scaling="rms_forces_scaling" \ --batch_size=2 \ --valid_batch_size=2 \ --max_num_epochs=200 \ --start_swa=150 \ --scheduler_patience=5 \ --patience=15 \ --eval_interval=1 \ --ema \ --ema_decay=0.99 \ --amsgrad \ --swa \ --swa_forces_weight=10 \ --error_table='PerAtomMAE' \ --default_dtype="float64"\ --device=cuda \ --seed=123 \ --restart_latest \ --distributed \ --save_cpu
Can you share the top of your log so I can look at how the GPUs were setup.
Also what is your torch vesion?
The log file has been attached. haha_run-123.log
Thank you. Looking at the log i don't think that has anything to do with MACE. Something happened to your GPUs that prevented communications. If you start again, do you see it crashes again after epoch 0 or at a different epoch?
The torch version is '2.4.0+cu121'
I started again and again, and it always fails after epoch 0.
Can you try to downgrade to torch 2.3?
ok, I will have a try. Thanks very much!
Can you see if it crashes before or after writing the checkpoint to the disk?
It crashes after writing the checkpoint to the disk.
It is probably happening when the master GPU reaches the second barrier and the two need to synch. For some reason at this point your second GPU is idle, and can not respond anymore, leading to a time out.
Can you go in this file https://github.com/ACEsuit/mace/blob/main/mace/tools/train.py and print the rank at the line 289.
I modify the train.py and print the rank. This time it crashes after epoch 2. The error message is like the following,
2024-08-26 08:13:24.098 INFO: Using gradient clipping with tolerance=10.000 2024-08-26 08:13:24.098 INFO: Started training 2024-08-26 08:13:46.902 INFO: Epoch None: loss=720.6777, MAE_E_per_atom=3672.0 meV, MAE_F=227.3 meV / A My rank is: 1 My rank is: 0 2024-08-26 08:26:49.998 INFO: Epoch 0: loss=2.7713, MAE_E_per_atom=39.7 meV, MAE_F=44.4 meV / A My rank is:My rank is: 10
2024-08-26 08:39:40.359 INFO: Epoch 1: loss=2.1196, MAE_E_per_atom=34.8 meV, MAE_F=38.3 meV / A My rank is:My rank is: 10
2024-08-26 08:52:28.594 INFO: Epoch 2: loss=1.3910, MAE_E_per_atom=28.8 meV, MAE_F=33.6 meV / A My rank is:My rank is: 10
[rank0]:[E826 08:53:23.114722578 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10260, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933936 milliseconds before timing out.
[rank1]:[E826 08:53:23.114941895 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10260, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 934056 milliseconds before timing out.
[rank1]:[E826 08:53:23.162548244 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 10260, last enqueued NCCL work: 17107, last completed NCCL work: 10259.
[rank0]:[E826 08:53:23.162556584 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 10260, last enqueued NCCL work: 17106, last completed NCCL work: 10259.
[rank1]:[E826 08:53:23.309172826 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 10260, last enqueued NCCL work: 17107, last completed NCCL work: 10259.
[rank1]:[E826 08:53:23.309213729 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E826 08:53:23.309228728 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E826 08:53:23.312406692 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 10260, last enqueued NCCL work: 17106, last completed NCCL work: 10259.
[rank0]:[E826 08:53:23.312441895 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E826 08:53:23.312452746 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E826 08:53:23.336914054 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10260, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 934056 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x2adc2badaf86 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x2adbf79518d2 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x2adbf7958313 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x2adbf795a6fc in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4:
[rank0]:[E826 08:53:23.338327558 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10260, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933936 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x2aeff73d6f86 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x2aefc324d8d2 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x2aefc3254313 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x2aefc32566fc in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4:
Failures:
Can you go in this file https://github.com/ACEsuit/mace/blob/main/mace/tools/train.py and print the rank at the line 289.
So what should I do next to avoid this problem ?
To me this is not a MACE problem, but a problem with your system. Sorry I can not help. You should request help from your system administrator.
Hi,
I want to train model by using multi-GPUs on our computer cluster which utilizes PBS job managing system. I refer to https://github.com/ACEsuit/mace/issues/458, and comment out the _setup_distr_env(self): function in mace/tools/slurm_distributed.py. However it seems it does not work. So what should I do to make it ? Thanks in advance !