Closed FarmerLiuAng closed 10 months ago
We changed it so that nccl-tests could be used as a tool to root out faulty nodes with HW issues, either in the CPUs, GPUs, NICs or network. Occasionally it can also help identify SW issues such as a missing PCI-E flush or a misuse of the LL128 protocol, which can take multiple iterations to provoke the data corruption.
Thanks for your reply. Firstly, I think the HW issues to be checked just need one iteration. As for SW issues , do you mean that many SW issues does not occur in every iteration? Could you please explain the two examples of SW issues mentioned above? I still don't understand why multiple iterations are needed . Thanks for your time.
I can't go in to details, but we recently used nccl-tests on a 1000 node machine to isolate a HW issue using the -c X option. With -c1 it took many attempts to expose the faulty node - if it failed at all. With that new option we were able to quickly isolate the faulty node and then run more low level tests on that node. Again, with the missing flush or LL128 issue, the data corruptions did not occur on every run and hence multiple check iterations were required to observe the issue with nccl-tests. When we get support issues about data corruptions or hangs on very large scale jobs (multiple thousands of GPUs) after multiple hours or days of execution, we can use the nccl-tests to help find or eliminate HW or SW issues as a source. Data corruption issues are often very hard to track down and rarely appear on every data transfer or job execution.
I got it! Thans for your reply!
HI! This commit 6c46206a478203b6453035fe0d40dc6418acd089 changed -c option. So, I want to know why more than one iteration is needed to check data. Can't one datacheck detect all the errors? Looking forward to your reply!