LaurentMazare / tch-rs

Rust bindings for the C++ api of PyTorch.
Apache License 2.0
4.33k stars 343 forks source link

Proposal: support distributed training in Rust #392

Closed NOBLES5E closed 1 year ago

NOBLES5E commented 3 years ago

As datasets and models grow larger, single GPU training can become a limiting factor in many moderate sized tasks. I am thinking of adding a distributed training example for tch. To achieve this, there are two things to be done

  1. Distributed communication engine supporting Rust: I can do it with our recently open sourced bagua, which has a Rust backend bagua-core.
  2. Tensor hooks so that we can schedule communication when for example a gradient is ready: we need to wrap the VariableHooksInterface.h in torch-sys, as mentioned in https://github.com/LaurentMazare/tch-rs/issues/218. This seems to be not difficult.

@LaurentMazare I appreciate if you have time commenting this and see if the direction is right. Thanks!

LaurentMazare commented 3 years ago

Sounds like a great idea. I would suggest implementing this in a separate repo/crate to start with as it hopefully will be independent from the main tch implementation, and we can have some link from the readme once ready so that it's easier to discover. Re (2) I'm not sure it's actually that easy, the thing I'm mostly worried about is deallocating the hook functions once the variables are not used any more. It's not very clear to me how that would work. This would only be an issue for closures and not for static functions but I doubt that hooks would be very useful for static functions only.

NOBLES5E commented 3 years ago

Great, I can start doing it (the distributed training part) in a separate repo to see if it works.

For (2) it seems that if we want to add hook support, it is better to be in this repo?

John0x commented 1 year ago

I guess this is dead?

Are there any other attempts at supporting distributed/parallel training for Rust ml?

NOBLES5E commented 1 year ago

@John0x Yes, this is dead since I left my previous company where I did distributed training. I would say this is a great topic to work on. Would love to see someone else gets interested in this.

LaurentMazare commented 1 year ago

Closing this for now as it indeed has been a while.

kevincox commented 1 year ago

Does it make sense to leave this open to track the feature even if it isn't currently planned? It would be nice to have a place to subscribe for updates.