We've seen jobs encounter EHOSTUNREACH when using IPoIB that could be relaunched immediately. For example, a link flap caused EHOSTUNREACH, and when the job was relaunched, it started and ran successfully. I've added no route retries to NCCL to avoid having to relaunch.
We've seen jobs encounter EHOSTUNREACH when using IPoIB that could be relaunched immediately. For example, a link flap caused EHOSTUNREACH, and when the job was relaunched, it started and ran successfully. I've added no route retries to NCCL to avoid having to relaunch.