Open vishalghor opened 2 years ago
@vishalghor could you please describe your setup? do you use any high level ML frameworks or tools for distributed trainings?
@gorarakelyan I use Azure with deepspeed-PyTorch(fastai) for distributed trainings. With multiple azure VMs i achieve multi-node scenario accomplish and use deepspeed to achieve faster training.
I'll also be looking at aim with huggingface Accelerate over these next two weeks and modeling a tracker off of our existing trackers in the framework see here as I play with it :)
We make our trackers log only on the main process, which avoids this duplication
Typically this falls under the user (I've seen in other libraries) to do this rather than magically handle it. Because I may want to log what specific tensors have data on different devices for instance
@muellerzr @vishalghor sorry for a late reply. afaik some high-level frameworks (e.g. PL/lightning-ai) track metadata only from rank 0 node. knock-knock also has a similar behavior. I was thinking about adding a default mechanism to track only from rank 0, but also providing a way to disable it and define own logic. In that way it will be easier to setup Aim for distributed trainings. Thoughts?
@gorarakelyan that would definitely be a nice way of doing it to make others lives easier! Just make sure it's stated somewhere that this is what's going on, as not everyone does this 😄
(And helps ease a few lines of the Accelerate
code for the integration as well :) )
@muellerzr awesome! Adding to the roadmap then 🙌
btw are you going to integrate Aim with Hugging Face's Accelerate? Would be happy to help you in any way we can.
@gorarakelyan slowly working my way there, yes :) I'm running some experiments and using Aim. Haven't gotten to the Accelerate part yet (been using raw torch as a baseline), but once I have it mixed in I'll open a draft PR in the Accelerate github and ping you to make sure I haven't missed anything 😄
@muellerzr sounds awesome! Please feel free to ask any question regarding Aim, as well as fire issues or feature requests. We are huge fans of Hugging Face, and I believe the integration would be beneficial for both communities. :hugs:
Opened a PR here for accelerate, would love feedback on if I happened to miss something important! :) https://github.com/huggingface/accelerate/pull/649
@muellerzr awesome! looking into it.
🚀 Feature
Looking at the examples and github repositories which use aim , I see that none of them use aim for mult-gpu or distributed training setup. The probable reason is the inaccurate logging by aim when used in distributed setting. Aim ends up creating a event or entry with identical data for each of the GPUs instead of a single one.
Motivation
Adding a feature which enables users to monitor their distributed training will surely help in increasing the user community of aim who would use and contirbute back to the community. Have more of this support will also help integrating aim with cloud based training using either AWS or Azure