Open Puzzzle7 opened 4 months ago
Currently, when I encounter a timeout error with NCCL, locating the hanging node is quite time-consuming. Does NCCL have a feature to achieve this? If not, could you provide ideas for implementing it in NCCL?
NCCL does not have any such features, but it is something we are currently investigating.
Currently, when I encounter a timeout error with NCCL, locating the hanging node is quite time-consuming. Does NCCL have a feature to achieve this? If not, could you provide ideas for implementing it in NCCL?