cisco-open / pymultiworld

A framework for PyTorch to enable fault management for collective communication libraries (CCL) such as NCCL
Apache License 2.0
15 stars 4 forks source link

fix: gloo backend #81

Closed myungjin closed 1 month ago

myungjin commented 1 month ago

Description

gloo backend doesn't work with isend/irecv (asynchronous point-to-point ops). See https://github.com/pytorch/pytorch/issues/30723. To mitigate the issue, we use a blocking version (send/recv) for gloo.

Type of Change

Checklist