aimhubio / aim

Aim 💫 — An easy-to-use & supercharged open-source experiment tracker.
https://aimstack.io
Apache License 2.0
4.93k stars 297 forks source link

Multiple runs created for a single distributed training task with AIM #3148

Open zhiyxu opened 1 month ago

zhiyxu commented 1 month ago

❓Question

When using AIM for a distributed training task with multiple GPUs (e.g., 8 GPUs), I noticed that each GPU generates a separate run with its own hyperparameters and metrics. As a result, for a single distributed training task with 8 GPUs, a total of 8 runs are created.

However, my expectation is to have only one run for the entire distributed training task, regardless of the number of GPUs used. Is this behavior expected, or is there a way to consolidate the runs into a single run for the entire task?

Having multiple runs for a single task makes it difficult to track and analyze the overall performance and metrics. It would be more convenient and intuitive to have a single run that aggregates the data from all GPUs involved in the distributed training process.

Please let me know if this behavior is intended or if there is a configuration option or workaround to achieve a single run for distributed training tasks with AIM.

SGevorg commented 2 weeks ago

@zhiyxu this makes sense, any chance you could share more about your setup and if possible scrips and other ways to reproduce this? We are currently working to fix these issues.