NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
809 stars 229 forks source link

test error: stuck when run test example #148

Open zhengmq2010 opened 1 year ago

zhengmq2010 commented 1 year ago

The programe got stuck like below situation for ~1h when I run ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8. image I follow the instruction in https://github.com/nvidia/nccl, and found no installation problem. It is very bothering. Do you have any idea what the problem could cause? Could it be hardware issue? Or other installed packages are not impatible. Thansks for your help. Here is my environment: cuda 11.6 ubuntu 18.04 A100 * 8

sjeaugey commented 1 year ago

Did you check that ACS was disabled? That would be the most probable cause.

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#pci-access-control-services-acs

zhengmq2010 commented 1 year ago

Thx for reply. I run command lspci -vvv | grep ACSCtl , and it returns lspci: Unable to load libkmod resources: error -12. I run command sudo lspci | grep PLX, it returns nothing. Then I run command sudo lspci | grep PLX, it returns something like this image

sjeaugey commented 1 year ago

The output doesn't look to be related to sudo lspci | grep PLX. It was not run as root (as shown by <access denied>) and it did not grep for PLX.

This looks like the output of lspci -vvv run as normal user. Can you run it as root?

zhengmq2010 commented 1 year ago

this is werid since I run all three commands as root.