Benchmarks for Metis - Githubissues

SamsungLabs / Metis

[ATC '24] Metis: Fast automatic distributed training on heterogeneous GPUs (https://www.usenix.org/conference/atc24/presentation/um)

Other

9 stars 5 forks source link

Benchmarks for Metis #10

Open YuMJie opened 6 days ago

YuMJie commented 6 days ago

It is the excellent work "Metis: Fast Automatic Distributed Training on Heterogeneous GPUs", however, I have a couple of questions about the code:

Why are the configuration files execution_memory are same for different micro batch sizes?
Can you provide the profile files of different devices (e.g., RTX3090, etc.).
I see that the profile file configured in the README.md is about activation memory but this is not used in the code.

Could you provide the benchmarks for metis? Thank you!

mgong-kang commented 2 days ago

Thank you for your interest in this project.

As you mentioned, memory usage varies depending on the micro batch size. the files in profile_data_samples are based on a micro batch size of 1. It appears there was an error in copying the sample data, resulting in incorrect values. I will update this with the correct values.
While we do not have precise measurement data for the exact model provided in the sample, we will add some reference data that may be helpful. Additionally, you might consider regenerating and running the existing data to reflect approximate device performance as an alternative approach.
To measure pipeline communication costs more accurately, it is recommended to profile and use the activation size in advance. In the Metis code, a GPT model is provided, and it includes code for calculating the activation size of the GPT model, which has been used for the calculations.

YuMJie commented 2 days ago

Thank you for your reply: But I have some questions:

I have seen the code to calculate the activation size, but I can not find any code using the item "activation_parameters_bytes" in the profile file.
Could you provide the code for executing the config that Metis generates?
the item "parameters" in profiler file means the size of model, but using mixed-precision training, it exists two model weights which have different sizes, and which wight size should be writen?

Thank you!

mgong-kang commented 2 days ago

activation_parameter_bytes is not currently used in the code. It is a format prepared for models where calculating the activation size per model is challenging.
Are you referring to the code for generating the profile? If so, there is a generation guide in the README.md, and I kindly ask for your understanding as we cannot provide the code.
Since it is important for the communication cost to reflect the actual weight size of the model, it is deemed appropriate to use the FP32 weight size for measuring communication cost, even if some weights are converted to FP16 for computational efficiency in mixed-precision training.

Thank you!

YuMJie commented 2 days ago

I got it! Thank you for your reply! I will closed this issue

YuMJie commented 17 hours ago

Hi, @mgong-kang I notice that you achieved the hete-aware data-parallelism load balanced, but I cannot see any code or output result about it. Could you provide some information? what is more. Could you provide how to use the metis config to run alpa? Thank you!

mgong-kang commented 15 hours ago

The DataLoadBalancer is implemented at the follwing path: https://github.com/SamsungLabs/Metis/blob/main/model/load_balancer.py#L147

mgong-kang commented 14 hours ago

@goeunee326 I would appreciate it if you could provide guidance on how to execute the Metis results in Alpa.

YuMJie commented 14 hours ago

The DataLoadBalancer is implemented at the follwing path: https://github.com/SamsungLabs/Metis/blob/main/model/load_balancer.py#L147

Thank you for your help, but it seems that the output of the Metis strategy does not reflect the batch size when data parallelism in different GPUs.

What is more, I found some same strategies have different costs.

mgong-kang commented 13 hours ago

@YuMJie

Data parallelism occurs when heterogeneous GPUs are allocated within a stage.
If you could send the profile data you've worked on, I'll take a look.

Thank you.