cisco-open / pymultiworld

A framework for PyTorch to enable fault management for collective communication libraries (CCL) such as NCCL
Apache License 2.0
15 stars 4 forks source link

feat: boolean function to check if world is broken #66

Closed myungjin closed 2 months ago

myungjin commented 2 months ago

Description

A broken world exception is raised for collective operations such as send and recv. However, if those operations are not called and runtime error is not raised, the broken world won't be noticed until any of the operations is called. To check if the world is broken or not without calling the operations, a new function is introduced.

Type of Change

Checklist