Closed 913871734 closed 3 months ago
same issue here
Two main reasons:
Two main reasons:
- Make sure ACS is disabled
- If you are using a Dell server, make sure you have the latest FW updates.
What does the firmware check you mentioned mainly focus on?
What does the firmware check you mentioned mainly focus on?
You should reach out to Dell for that, they should be able to assist you.
What does the firmware check you mentioned mainly focus on?
You should reach out to Dell for that, they should be able to assist you.
I don't know what element's firmware to be checked, the nccl firmware version? or the vendor platform firmware version? or rdma-core version?
The vendor platform firmware (BIOS/UEFI/whatever you want to call it). There is some general info at https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html, specifically the subsection on ACS.
vendor platform firmware
ok, thanks a lot
Hi, I have met an issue that the mission failed due to the work complete failed. the detailed log as followings, I wonder what does the err log mean? And what scenarios usually produce such error codes?