cisco-open / pymultiworld

A framework for PyTorch to enable fault management for collective communication libraries (CCL) such as NCCL
Apache License 2.0
15 stars 4 forks source link

feat: added multiworld all_reduce example #33

Closed raresgaia123 closed 2 months ago

raresgaia123 commented 2 months ago

all_reduce is called with SUM on each tensor for each rank within a world after the all_reduce is called, each tensor from each rank should have the same value

Description

Please provide a meaningful description of what this change will do, or is for. Bonus points for including links to related issues, other PRs, or technical references.

Note that by not including a description, you are asking reviewers to do extra work to understand the context of this change, which may lead to your PR taking much longer to review, or result in it not being reviewed at all.

Type of Change

Checklist

raresgaia123 commented 2 months ago

done