cisco-open / pymultiworld

A framework for PyTorch to enable fault management for collective communication libraries (CCL) such as NCCL
Apache License 2.0
15 stars 4 forks source link

Is FSDP/DDP supported ? #88

Open samsja opened 1 month ago

samsja commented 1 month ago

Hey, thanks for the great project. Very excited about using it.

When doing the post-install I noticed that some internal torch distributed code seems to be patched and I was wondering what was the scope of these modifications. Specifically, I see some patch related to fsdp and ddp and I was wondering if they were a chance that py multi-world would be already compatible with the FSDP and DDP class of pytorch ?

And if not I would be curious what you think would be needed for supporting it

Thanks in advance :pray:

myungjin commented 1 month ago

@samsja Thank you for your interest in our project.

The patch we apply is to make the world in pytorch's distributed package a non-singleton object. In order to do so, some code in fsdp and ddp needed to be refactored because they used some primitives in the pytorch's distributed package. The changes keep the behaviors of the original pytorch's FSDP and DDP if pytorch is used without using multiworld's API.

We developed multiworld for inference, not for training. Since our library supports most CCL primitives, we need to do test and investigation to support FSDP and DDP. Right now, they are not our priority.

samsja commented 1 month ago

@samsja Thank you for your interest in our project.

The patch we apply is to make the world in pytorch's distributed package a non-singleton object. In order to do so, some code in fsdp and ddp needed to be refactored because they used some primitives in the pytorch's distributed package. The changes keep the behaviors of the original pytorch's FSDP and DDP if pytorch is used without using multiworld's API.

We developed multiworld for inference, not for training. Since our library supports most CCL primitives, we need to do test and investigation to support FSDP and DDP. Right now, they are not our priority.

thanks @myungjin for the response!