Open YuMJie opened 6 days ago
Thank you for your interest in this project.
As you mentioned, memory usage varies depending on the micro batch size. the files in profile_data_samples are based on a micro batch size of 1. It appears there was an error in copying the sample data, resulting in incorrect values. I will update this with the correct values.
While we do not have precise measurement data for the exact model provided in the sample, we will add some reference data that may be helpful. Additionally, you might consider regenerating and running the existing data to reflect approximate device performance as an alternative approach.
To measure pipeline communication costs more accurately, it is recommended to profile and use the activation size in advance. In the Metis code, a GPT model is provided, and it includes code for calculating the activation size of the GPT model, which has been used for the calculations.
Thank you for your reply: But I have some questions:
Thank you!
Thank you!
I got it! Thank you for your reply! I will closed this issue
Hi, @mgong-kang I notice that you achieved the hete-aware data-parallelism load balanced, but I cannot see any code or output result about it. Could you provide some information? what is more. Could you provide how to use the metis config to run alpa? Thank you!
The DataLoadBalancer is implemented at the follwing path: https://github.com/SamsungLabs/Metis/blob/main/model/load_balancer.py#L147
@goeunee326 I would appreciate it if you could provide guidance on how to execute the Metis results in Alpa.
The DataLoadBalancer is implemented at the follwing path: https://github.com/SamsungLabs/Metis/blob/main/model/load_balancer.py#L147
Thank you for your help, but it seems that the output of the Metis strategy does not reflect the batch size when data parallelism in different GPUs.
What is more, I found some same strategies have different costs.
@YuMJie
Data parallelism occurs when heterogeneous GPUs are allocated within a stage.
If you could send the profile data you've worked on, I'll take a look.
Thank you.
It is the excellent work "Metis: Fast Automatic Distributed Training on Heterogeneous GPUs", however, I have a couple of questions about the code:
Could you provide the benchmarks for metis? Thank you!