ACEsuit / mace

MACE - Fast and accurate machine learning interatomic potentials with higher order equivariant message passing.
Other
480 stars 181 forks source link

How to use multi-GPUs training with PBS system #567

Closed jinfeng-data closed 2 weeks ago

jinfeng-data commented 2 weeks ago

Hi,

I want to train model by using multi-GPUs on our computer cluster which utilizes PBS job managing system. I refer to https://github.com/ACEsuit/mace/issues/458, and comment out the _setup_distr_env(self): function in mace/tools/slurm_distributed.py. However it seems it does not work. So what should I do to make it ? Thanks in advance !

ilyes319 commented 2 weeks ago

Can you link the error that you have observed? Are you training on a single node or multiple nodes? What you need to do is to be able to set all of the required environement variables outlined in the code. I recommend you speak to your system administrator in case you need more detailed help.

jinfeng-data commented 2 weeks ago

Thanks for your prompt reply. I am first training on a single node. The error message is as the following,

2024-08-25 22:27:52.631 INFO: Using gradient clipping with tolerance=10.000 2024-08-25 22:27:52.631 INFO: Started training 2024-08-25 22:28:16.122 INFO: Epoch None: loss=720.6777, MAE_E_per_atom=3672.0 meV, MAE_F=227.3 meV / A 2024-08-25 22:41:15.636 INFO: Epoch 0: loss=2.2208, MAE_E_per_atom=35.8 meV, MAE_F=41.9 meV / A [rank0]:[E825 22:53:11.599226728 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3965, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933807 milliseconds before timing out. [rank1]:[E825 22:53:11.599481332 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3965, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933855 milliseconds before timing out. [rank1]:[E825 22:53:11.679520115 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 3965, last enqueued NCCL work: 10659, last completed NCCL work: 3964. [rank0]:[E825 22:53:11.679530396 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 3965, last enqueued NCCL work: 10659, last completed NCCL work: 3964. [rank0]:[E825 22:53:11.203585637 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 3965, last enqueued NCCL work: 10659, last completed NCCL work: 3964. [rank0]:[E825 22:53:11.203651114 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank0]:[E825 22:53:11.203671433 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E825 22:53:11.206360040 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 3965, last enqueued NCCL work: 10659, last completed NCCL work: 3964. [rank1]:[E825 22:53:11.206422589 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E825 22:53:11.206441064 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [rank0]:[E825 22:53:11.224901535 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3965, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933807 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x2b83bb9ebf86 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x2b83878628d2 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x2b8387869313 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x2b838786b6fc in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x2b836cefabf4 in /public/home/xiaohe/miniconda3/bin/../lib/libstdc++.so.6) frame #5: + 0x7e25 (0x2b83610d7e25 in /lib64/libpthread.so.0) frame #6: clone + 0x6d (0x2b8361aed34d in /lib64/libc.so.6)

[rank1]:[E825 22:53:11.225026392 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3965, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933855 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x2b56b9703f86 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x2b568557a8d2 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x2b5685581313 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x2b56855836fc in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x2b566ac12bf4 in /public/home/xiaohe/miniconda3/bin/../lib/libstdc++.so.6) frame #5: + 0x7e25 (0x2b565edefe25 in /lib64/libpthread.so.0) frame #6: clone + 0x6d (0x2b565f80534d in /lib64/libc.so.6) W0825 22:53:12.892000 47471442153280 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 190700 closing signal SIGTERM E0825 22:53:13.172000 47471442153280 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 190699) of binary: /public/home/xiaohe/jinfeng/soft/mace-venv/bin/python Traceback (most recent call last): File "/public/home/xiaohe/jinfeng/soft/mace-venv/bin/torchrun", line 8, in sys.exit(main()) ^^^^^^ File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/public/home/xiaohe/jinfeng/soft/mace-venv/bin/mace_run_train FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-08-25_22:53:12 host : gpu9 rank : 0 (local_rank: 0) exitcode : -6 (pid: 190699) error_file: traceback : Signal 6 (SIGABRT) received by PID 190699 ============================================================ Permission denied, please try again.^M Received disconnect from 10.11.100.1 port 22:2: Too many authentication failures for root^M Authentication failed.^M
ilyes319 commented 2 weeks ago

What GPUs are you using? You need to ask your system administrator provide ports that are accessible for NCCL to communicate between GPUs. Did you make sure to provide the right environement variables?

jinfeng-data commented 2 weeks ago

The training fails after Epoch 0.

jinfeng-data commented 2 weeks ago

I am training on A100.

jinfeng-data commented 2 weeks ago

My training script is like:

!/bin/bash -x

PBS -N mace

PBS -l nodes=1:ppn=2:gpus=2

PBS -j oe

PBS -q gpu_a100

#

define variables

#

cd $PBS_O_WORKDIR

export CUDA_HOME=/public/software/compiler/cuda-11.3 export PATH=$PATH:$CUDA_HOME/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64

torchrun --standalone --nnodes=1 --nproc_per_node=2 /public/home/xiaohe/jinfeng/soft/mace-venv/bin/mace_run_train \ --name="haha" \ --foundation_model="./2023-12-03-mace-128-L1_epoch-199.model" \ --train_file="mace_trainingset.xyz" \ --valid_fraction=0.05 \ --E0s="isolated" \ --forces_weight=1000 \ --energy_weight=100 \ --lr=0.01 \ --scaling="rms_forces_scaling" \ --batch_size=2 \ --valid_batch_size=2 \ --max_num_epochs=200 \ --start_swa=150 \ --scheduler_patience=5 \ --patience=15 \ --eval_interval=1 \ --ema \ --ema_decay=0.99 \ --amsgrad \ --swa \ --swa_forces_weight=10 \ --error_table='PerAtomMAE' \ --default_dtype="float64"\ --device=cuda \ --seed=123 \ --restart_latest \ --distributed \ --save_cpu

ilyes319 commented 2 weeks ago

Can you share the top of your log so I can look at how the GPUs were setup.

ilyes319 commented 2 weeks ago

Also what is your torch vesion?

jinfeng-data commented 2 weeks ago

The log file has been attached. haha_run-123.log

ilyes319 commented 2 weeks ago

Thank you. Looking at the log i don't think that has anything to do with MACE. Something happened to your GPUs that prevented communications. If you start again, do you see it crashes again after epoch 0 or at a different epoch?

jinfeng-data commented 2 weeks ago

The torch version is '2.4.0+cu121'

jinfeng-data commented 2 weeks ago

I started again and again, and it always fails after epoch 0.

ilyes319 commented 2 weeks ago

Can you try to downgrade to torch 2.3?

jinfeng-data commented 2 weeks ago

ok, I will have a try. Thanks very much!

ilyes319 commented 2 weeks ago

Can you see if it crashes before or after writing the checkpoint to the disk?

jinfeng-data commented 2 weeks ago

It crashes after writing the checkpoint to the disk.

ilyes319 commented 2 weeks ago

It is probably happening when the master GPU reaches the second barrier and the two need to synch. For some reason at this point your second GPU is idle, and can not respond anymore, leading to a time out.

ilyes319 commented 2 weeks ago

Can you go in this file https://github.com/ACEsuit/mace/blob/main/mace/tools/train.py and print the rank at the line 289.

jinfeng-data commented 2 weeks ago

I modify the train.py and print the rank. This time it crashes after epoch 2. The error message is like the following,

2024-08-26 08:13:24.098 INFO: Using gradient clipping with tolerance=10.000 2024-08-26 08:13:24.098 INFO: Started training 2024-08-26 08:13:46.902 INFO: Epoch None: loss=720.6777, MAE_E_per_atom=3672.0 meV, MAE_F=227.3 meV / A My rank is: 1 My rank is: 0 2024-08-26 08:26:49.998 INFO: Epoch 0: loss=2.7713, MAE_E_per_atom=39.7 meV, MAE_F=44.4 meV / A My rank is:My rank is: 10

2024-08-26 08:39:40.359 INFO: Epoch 1: loss=2.1196, MAE_E_per_atom=34.8 meV, MAE_F=38.3 meV / A My rank is:My rank is: 10

2024-08-26 08:52:28.594 INFO: Epoch 2: loss=1.3910, MAE_E_per_atom=28.8 meV, MAE_F=33.6 meV / A My rank is:My rank is: 10

[rank0]:[E826 08:53:23.114722578 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10260, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933936 milliseconds before timing out. [rank1]:[E826 08:53:23.114941895 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10260, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 934056 milliseconds before timing out. [rank1]:[E826 08:53:23.162548244 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 10260, last enqueued NCCL work: 17107, last completed NCCL work: 10259. [rank0]:[E826 08:53:23.162556584 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 10260, last enqueued NCCL work: 17106, last completed NCCL work: 10259. [rank1]:[E826 08:53:23.309172826 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 10260, last enqueued NCCL work: 17107, last completed NCCL work: 10259. [rank1]:[E826 08:53:23.309213729 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E826 08:53:23.309228728 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [rank0]:[E826 08:53:23.312406692 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 10260, last enqueued NCCL work: 17106, last completed NCCL work: 10259. [rank0]:[E826 08:53:23.312441895 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank0]:[E826 08:53:23.312452746 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E826 08:53:23.336914054 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10260, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 934056 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x2adc2badaf86 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x2adbf79518d2 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x2adbf7958313 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x2adbf795a6fc in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x2adbdcfe9bf4 in /public/home/xiaohe/miniconda3/bin/../lib/libstdc++.so.6) frame #5: + 0x7e25 (0x2adbd11c6e25 in /lib64/libpthread.so.0) frame #6: clone + 0x6d (0x2adbd1bdc34d in /lib64/libc.so.6)

[rank0]:[E826 08:53:23.338327558 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10260, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933936 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x2aeff73d6f86 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x2aefc324d8d2 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x2aefc3254313 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x2aefc32566fc in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x2aefa88e5bf4 in /public/home/xiaohe/miniconda3/bin/../lib/libstdc++.so.6) frame #5: + 0x7e25 (0x2aef9cac2e25 in /lib64/libpthread.so.0) frame #6: clone + 0x6d (0x2aef9d4d834d in /lib64/libc.so.6)

W0826 08:53:24.845000 47874619043648 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 403057 closing signal SIGTERM E0826 08:53:25.070000 47874619043648 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 1 (pid: 403058) of binary: /public/home/xiaohe/jinfeng/soft/mace-venv/bin/python Traceback (most recent call last): File "/public/home/xiaohe/jinfeng/soft/mace-venv/bin/torchrun", line 8, in sys.exit(main()) ^^^^^^ File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/public/home/xiaohe/jinfeng/soft/mace-venv/bin/mace_run_train FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-08-26_08:53:24 host : gpu9 rank : 1 (local_rank: 1) exitcode : -6 (pid: 403058) error_file: traceback : Signal 6 (SIGABRT) received by PID 403058 ============================================================
jinfeng-data commented 2 weeks ago

Can you go in this file https://github.com/ACEsuit/mace/blob/main/mace/tools/train.py and print the rank at the line 289.

So what should I do next to avoid this problem ?

ilyes319 commented 2 weeks ago

To me this is not a MACE problem, but a problem with your system. Sorry I can not help. You should request help from your system administrator.