Open andrew-johnson-melb opened 3 years ago
This is probably because the slurm daemon was launched with a limited ability to lock memory. Infiniband needs to lock a certain amount of memory for the NIC communication. srun bash -c "ulimit -l"
would confirm that. To fix the issue, you can set a higher limit in /etc/security/limits.conf
and restart the SLURM daemon on compute nodes.
Great, thanks @sjeaugey. That did fix the error, much appreciated!
However, now it seems to be training about 40% slower than the local version. Any suggestions?
The best would probably be to run the NCCL perf tests to see whether the performance difference comes from NCCL or something else (e.g. CPU affinity).
Solved. Thanks!
Solved. Thanks!
@andrew-johnson-melb How did you solve this? It would be helpful to post your solution. Im getting this error EVERYWHERE in my code when train model with mirror strategy on A100 GPU when my code works just fine on P100 cards.
System information
The distributed training runs fails when training via slurm (using srun).
The code is run inside an enroot container. Due to slurm this container has a number of slurm specific environment variables set.
So, using
MirroredStrategy
to distribute training fails due to NCCL errors on a simple example.Note, a number of other distributed options work as is highlighted in the code.
NOTE, this code works fine outside of the slurm environment (in the exact same container). The slurm environment variables seem to be creating an issue with NCCL.
The srun command looks like
Error
Error when using 21.07 container and tf-nightly
Function call stack: train_function -> train_function -> train_function -> train_function -> train_function -> train_function -> train_function -> train_function