Logging for Distributed training on single and multi-node

aimhubio / aim

Aim 💫 — An easy-to-use & supercharged open-source experiment tracker.

https://aimstack.io

Apache License 2.0

5.21k stars 320 forks source link

Logging for Distributed training on single and multi-node #2056

Open vishalghor opened 2 years ago

vishalghor commented 2 years ago

🚀 Feature

Looking at the examples and github repositories which use aim , I see that none of them use aim for mult-gpu or distributed training setup. The probable reason is the inaccurate logging by aim when used in distributed setting. Aim ends up creating a event or entry with identical data for each of the GPUs instead of a single one.

Motivation

Adding a feature which enables users to monitor their distributed training will surely help in increasing the user community of aim who would use and contirbute back to the community. Have more of this support will also help integrating aim with cloud based training using either AWS or Azure

gorarakelyan commented 2 years ago

@vishalghor could you please describe your setup? do you use any high level ML frameworks or tools for distributed trainings?

vishalghor commented 2 years ago

@gorarakelyan I use Azure with deepspeed-PyTorch(fastai) for distributed trainings. With multiple azure VMs i achieve multi-node scenario accomplish and use deepspeed to achieve faster training.

muellerzr commented 2 years ago

I'll also be looking at aim with huggingface Accelerate over these next two weeks and modeling a tracker off of our existing trackers in the framework see here as I play with it :)

We make our trackers log only on the main process, which avoids this duplication

Typically this falls under the user (I've seen in other libraries) to do this rather than magically handle it. Because I may want to log what specific tensors have data on different devices for instance

gorarakelyan commented 2 years ago

@muellerzr @vishalghor sorry for a late reply. afaik some high-level frameworks (e.g. PL/lightning-ai) track metadata only from rank 0 node. knock-knock also has a similar behavior. I was thinking about adding a default mechanism to track only from rank 0, but also providing a way to disable it and define own logic. In that way it will be easier to setup Aim for distributed trainings. Thoughts?

muellerzr commented 2 years ago

@gorarakelyan that would definitely be a nice way of doing it to make others lives easier! Just make sure it's stated somewhere that this is what's going on, as not everyone does this 😄

(And helps ease a few lines of the Accelerate code for the integration as well :) )

gorarakelyan commented 2 years ago

@muellerzr awesome! Adding to the roadmap then 🙌

gorarakelyan commented 2 years ago

btw are you going to integrate Aim with Hugging Face's Accelerate? Would be happy to help you in any way we can.

muellerzr commented 2 years ago

@gorarakelyan slowly working my way there, yes :) I'm running some experiments and using Aim. Haven't gotten to the Accelerate part yet (been using raw torch as a baseline), but once I have it mixed in I'll open a draft PR in the Accelerate github and ping you to make sure I haven't missed anything 😄

gorarakelyan commented 2 years ago

@muellerzr sounds awesome! Please feel free to ask any question regarding Aim, as well as fire issues or feature requests. We are huge fans of Hugging Face, and I believe the integration would be beneficial for both communities. :hugs:

muellerzr commented 2 years ago

Opened a PR here for accelerate, would love feedback on if I happened to miss something important! :) https://github.com/huggingface/accelerate/pull/649

gorarakelyan commented 2 years ago

@muellerzr awesome! looking into it.