cisco-open / pymultiworld

A framework for PyTorch to enable fault management for collective communication libraries (CCL) such as NCCL
Apache License 2.0
15 stars 4 forks source link

misc: make m8d-send-recv script flexible #42

Closed myungjin closed 2 months ago

myungjin commented 2 months ago

Description

This code change allows to specify rank 0 flexibly, meaning that a process belonging to more than one world can be a worker in one world and a leader in another world.

Type of Change

Checklist