cisco-open / pymultiworld

A framework for PyTorch to enable fault management for collective communication libraries (CCL) such as NCCL
Apache License 2.0
15 stars 4 forks source link

misc+doc: patch file for pytorch's nccl support #11

Closed myungjin closed 4 months ago

myungjin commented 4 months ago

Description

This patch file updates the ncclResult_t type. ncclRemoteError code is added since nccl v2.13.4. In pytorch's code, the code is not updated. This causes the Unconvertible NCCL type error when a remote worker terminates or crashes.

Also, 'AutoNcclGroup nccl_group_guard' is removed to allow 'TORCH_NCCL_ASYNC_ERROR_HANDLING = 2 (CleanUpOnly)' to work. with nccl_group_guard, a new exception is thrown while another exception is propagating, which causes the termination of the process in c++. To make CleanUpOnly possible, this guard is removed.

The details on this patch are also documented.

Type of Change

Checklist