cisco-open / pymultiworld

A framework for PyTorch to enable fault management for collective communication libraries (CCL) such as NCCL
Apache License 2.0
15 stars 4 forks source link

temp fix: disable destory_process_group #5

Closed myungjin closed 4 months ago

myungjin commented 4 months ago

Description

When a broken world (process group) is detected, attempting to destroy the process group causes the program to hang. We temporarily disable this call to prevent this deadlock situation. We will revisit this later.

Note that by not including a description, you are asking reviewers to do extra work to understand the context of this change, which may lead to your PR taking much longer to review, or result in it not being reviewed at all.

Type of Change

Checklist