Open zhengmq2010 opened 1 year ago
Did you check that ACS was disabled? That would be the most probable cause.
Thx for reply.
I run command lspci -vvv | grep ACSCtl
, and it returns lspci: Unable to load libkmod resources: error -12
.
I run command sudo lspci | grep PLX
, it returns nothing.
Then I run command sudo lspci | grep PLX
, it returns something like this
The output doesn't look to be related to sudo lspci | grep PLX
. It was not run as root (as shown by <access denied>
) and it did not grep for PLX.
This looks like the output of lspci -vvv
run as normal user. Can you run it as root?
this is werid since I run all three commands as root.
The programe got stuck like below situation for ~1h when I run
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
. I follow the instruction in https://github.com/nvidia/nccl, and found no installation problem. It is very bothering. Do you have any idea what the problem could cause? Could it be hardware issue? Or other installed packages are not impatible. Thansks for your help. Here is my environment: cuda 11.6 ubuntu 18.04 A100 * 8