cisco-open / pymultiworld

A framework for PyTorch to enable fault management for collective communication libraries (CCL) such as NCCL
Apache License 2.0
16 stars 4 forks source link

refactor: examples error handling #64

Closed raresgaia123 closed 3 months ago

raresgaia123 commented 3 months ago

added error handling for examples updated ccl operaions with rank 0 as the actor updated logs for a better understanding of the operations

Description

Please provide a meaningful description of what this change will do, or is for. Bonus points for including links to related issues, other PRs, or technical references.

Note that by not including a description, you are asking reviewers to do extra work to understand the context of this change, which may lead to your PR taking much longer to review, or result in it not being reviewed at all.

Type of Change

Checklist

myungjin commented 3 months ago

This PR is not a feature. This PR is mostly refactoring. So, please update the commit title in git.