facebookresearch / moolib

A library for distributed ML training with PyTorch
MIT License
366 stars 20 forks source link

RPC update; update TensorPipe, enable infiniband, various rpc-related updates & fixes #26

Closed tscmoo closed 2 years ago

tscmoo commented 2 years ago

This brings tensorpipe up to date with the latest version. InfiniBand is now enabled by default, and all of the code for handling CUDA tensors is present in the RPC, but CUDA is still disabled by default, as CUDA tensors are not yet supported in all-reduce, and a bit more testing should be done.

It's a fair bit of code, but among some fixes/changes:

heiner commented 2 years ago

I'd suggest upping the version for this PR.

hengyuan-hu commented 2 years ago

Hi, what's the status of this PR?

tscmoo commented 2 years ago

Hi, what's the status of this PR?

I'd love to merge it, but I noticed a significant regression in some training jobs, and haven't had time yet to debug it. I promise to look into this ASAP

hengyuan-hu commented 2 years ago

I tried to install this branch and indeed the error in this issue https://github.com/facebookresearch/moolib/issues/27 disappears. But I also observed that this version is quite slow to run. Take your time. It is not blocking anything yet.