cisco-open / pymultiworld

A framework for PyTorch to enable fault management for collective communication libraries (CCL) such as NCCL
Apache License 2.0
16 stars 4 forks source link

refactor: restructuring packaging #83

Closed myungjin closed 2 months ago

myungjin commented 2 months ago

Description

Currently, world_manager.py and world_communicator.py are copied into torch package and they become part of torch after installing multiworld. This is inappropriate. We keep those files under multiworld package, but we patch pytorch such that the variables (_worlds, _World, etc) required by multiworld are exposed by patching init.py in distributed module. Therefore, pytorch patch file is updated and all other necessary changes are made to ensure that all the examples can be executed.

Type of Change

Checklist