NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.28k stars 831 forks source link

Added retries to EHOSTUNREACH socket error. #1311

Open newellz2 opened 6 months ago

newellz2 commented 6 months ago

We've seen jobs encounter EHOSTUNREACH when using IPoIB that could be relaunched immediately. For example, a link flap caused EHOSTUNREACH, and when the job was relaunched, it started and ran successfully. I've added no route retries to NCCL to avoid having to relaunch.