FMInference / FlexLLMGen

Running large language models on a single GPU for throughput-oriented scenarios.
Apache License 2.0
9.2k stars 547 forks source link

How can I calculate `*mm_flops*` on other GPU which is used in cost_model.py? #122

Closed minhopark-neubla closed 1 year ago

minhopark-neubla commented 1 year ago

https://github.com/FMInference/FlexGen/blob/d34f7b4b43ed87a374f394b0535ed685af66197b/experimental/cost_model.py#L73-L76

Hello! Thank you for sharing your great work!

I have a question. I want to calculate a cost_model.py on other GPU (e.g. A6000, A100...).

They have different FLOPs and GPU RAM memory bandwidth. But in the cost_model.py, *mm_flops* are just magic numbers, and it seems that don't consider GPU RAM bandwidth.

Is there any method about calculating *mm_flops*?

Thank you.

Ying1123 commented 1 year ago

The cost model here is a rough estimate. The real execution time can have a more complicated pattern. As written at the beginning of the file, we get those magic numbers by fitting real runs. More specifically, we collect data points (batch size, sequence length, model size, etc, and execution time) from real runs. We then use gradient descent to fit the constants (mm_flops, bmm_flops, etc) in the cost model.