NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.28k stars 829 forks source link

work request complete err: status 5 and vendor err 249 #1353

Closed 913871734 closed 3 months ago

913871734 commented 4 months ago

Hi, I have met an issue that the mission failed due to the work complete failed. the detailed log as followings, I wonder what does the err log mean? And what scenarios usually produce such error codes?

image

Yangruipis commented 4 months ago

same issue here

sjeaugey commented 4 months ago

Two main reasons:

  1. Make sure ACS is disabled
  2. If you are using a Dell server, make sure you have the latest FW updates.
913871734 commented 4 months ago

Two main reasons:

  1. Make sure ACS is disabled
  2. If you are using a Dell server, make sure you have the latest FW updates.

What does the firmware check you mentioned mainly focus on?

sjeaugey commented 4 months ago

What does the firmware check you mentioned mainly focus on?

You should reach out to Dell for that, they should be able to assist you.

913871734 commented 4 months ago

What does the firmware check you mentioned mainly focus on?

You should reach out to Dell for that, they should be able to assist you.

I don't know what element's firmware to be checked, the nccl firmware version? or the vendor platform firmware version? or rdma-core version?

kiskra-nvidia commented 4 months ago

The vendor platform firmware (BIOS/UEFI/whatever you want to call it). There is some general info at https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html, specifically the subsection on ACS.

913871734 commented 4 months ago

vendor platform firmware

ok, thanks a lot