NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
809 stars 229 forks source link

Why need more than one iteration to check data? #172

Closed FarmerLiuAng closed 10 months ago

FarmerLiuAng commented 10 months ago

HI! This commit 6c46206a478203b6453035fe0d40dc6418acd089 changed -c option. So, I want to know why more than one iteration is needed to check data. Can't one datacheck detect all the errors? Looking forward to your reply!

AddyLaddy commented 10 months ago

We changed it so that nccl-tests could be used as a tool to root out faulty nodes with HW issues, either in the CPUs, GPUs, NICs or network. Occasionally it can also help identify SW issues such as a missing PCI-E flush or a misuse of the LL128 protocol, which can take multiple iterations to provoke the data corruption.

FarmerLiuAng commented 10 months ago

Thanks for your reply. Firstly, I think the HW issues to be checked just need one iteration. As for SW issues , do you mean that many SW issues does not occur in every iteration? Could you please explain the two examples of SW issues mentioned above? I still don't understand why multiple iterations are needed . Thanks for your time.

AddyLaddy commented 10 months ago

I can't go in to details, but we recently used nccl-tests on a 1000 node machine to isolate a HW issue using the -c X option. With -c1 it took many attempts to expose the faulty node - if it failed at all. With that new option we were able to quickly isolate the faulty node and then run more low level tests on that node. Again, with the missing flush or LL128 issue, the data corruptions did not occur on every run and hence multiple check iterations were required to observe the issue with nccl-tests. When we get support issues about data corruptions or hangs on very large scale jobs (multiple thousands of GPUs) after multiple hours or days of execution, we can use the nccl-tests to help find or eliminate HW or SW issues as a source. Data corruption issues are often very hard to track down and rarely appear on every data transfer or job execution.

FarmerLiuAng commented 10 months ago

I got it! Thans for your reply!