Closed NOBLES5E closed 1 year ago
Sounds like a great idea. I would suggest implementing this in a separate repo/crate to start with as it hopefully will be independent from the main tch
implementation, and we can have some link from the readme once ready so that it's easier to discover.
Re (2) I'm not sure it's actually that easy, the thing I'm mostly worried about is deallocating the hook functions once the variables are not used any more. It's not very clear to me how that would work. This would only be an issue for closures and not for static functions but I doubt that hooks would be very useful for static functions only.
Great, I can start doing it (the distributed training part) in a separate repo to see if it works.
For (2) it seems that if we want to add hook support, it is better to be in this repo?
I guess this is dead?
Are there any other attempts at supporting distributed/parallel training for Rust ml?
@John0x Yes, this is dead since I left my previous company where I did distributed training. I would say this is a great topic to work on. Would love to see someone else gets interested in this.
Closing this for now as it indeed has been a while.
Does it make sense to leave this open to track the feature even if it isn't currently planned? It would be nice to have a place to subscribe for updates.
As datasets and models grow larger, single GPU training can become a limiting factor in many moderate sized tasks. I am thinking of adding a distributed training example for
tch
. To achieve this, there are two things to be doneVariableHooksInterface.h
intorch-sys
, as mentioned in https://github.com/LaurentMazare/tch-rs/issues/218. This seems to be not difficult.@LaurentMazare I appreciate if you have time commenting this and see if the direction is right. Thanks!