NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

misc/ibvwrap.cc:278 NCCL WARN Call to ibv_reg_mr_iova2 failed with error Cannot allocate memory #206

Closed jxh314 closed 3 months ago

jxh314 commented 3 months ago

Hello, when I run the following command to disable shm on a single node in order to force the use of ib for testing, I encounter the following error:

image

Modifying the /etc/security/limits.conf configuration file as referenced in https://docs.nvidia.com/deeplearning/nccl/archives/nccl_2143/user-guide/docs/troubleshooting.html#infiniband also does not take effect. The log, topo and graph file are as follows:

By the way, the perftest is ok. Can you help with that?

sjeaugey commented 3 months ago

Modifying /etc/security/limits.conf is the long term solution so that limits are good on the next reboot. Did you reboot? Did you ensure ulimit -l was indeed showing the right values, even within mpirun?

You can ensure that running mpirun <mpirun args> <some script printing ulimit -l>

jxh314 commented 3 months ago

Thanks a lot, it works! before setting this, the result of ulimit -l was only 8192.