This patch file updates the ncclResult_t type. ncclRemoteError code is added since nccl v2.13.4. In pytorch's code, the code is not updated. This causes the Unconvertible NCCL type error when a remote worker terminates or crashes.
Also, 'AutoNcclGroup nccl_group_guard' is removed to allow 'TORCH_NCCL_ASYNC_ERROR_HANDLING = 2 (CleanUpOnly)' to work. with nccl_group_guard, a new exception is thrown while another exception is propagating, which causes the termination of the process in c++. To make CleanUpOnly possible, this guard is removed.
Description
This patch file updates the ncclResult_t type. ncclRemoteError code is added since nccl v2.13.4. In pytorch's code, the code is not updated. This causes the Unconvertible NCCL type error when a remote worker terminates or crashes.
Also, 'AutoNcclGroup nccl_group_guard' is removed to allow 'TORCH_NCCL_ASYNC_ERROR_HANDLING = 2 (CleanUpOnly)' to work. with nccl_group_guard, a new exception is thrown while another exception is propagating, which causes the termination of the process in c++. To make CleanUpOnly possible, this guard is removed.
The details on this patch are also documented.
Type of Change
Checklist