cisco-open / pymultiworld

A framework for PyTorch to enable fault management for collective communication libraries (CCL) such as NCCL
Apache License 2.0
15 stars 4 forks source link

fix: pytorch v2.2.1 patch #28

Closed myungjin closed 3 months ago

myungjin commented 3 months ago

Description

To support multiworld, pytorch v2.2.1 was patched previously. The patch file has two bugs: (1) in reduce funtion, GroupMember[name] should be GroupMember as GroupMember is a class not a dictionary; (2) barrier function takes GroupMember.WORLD as a default value for group. Since WORLD doesn't exist any more, we pass None as default value and if group is None, we load a default group based on the world name.

Type of Change

Checklist