Closed minhopark-neubla closed 1 year ago
The cost model here is a rough estimate. The real execution time can have a more complicated pattern. As written at the beginning of the file, we get those magic numbers by fitting real runs. More specifically, we collect data points (batch size, sequence length, model size, etc, and execution time) from real runs. We then use gradient descent to fit the constants (mm_flops, bmm_flops, etc) in the cost model.
https://github.com/FMInference/FlexGen/blob/d34f7b4b43ed87a374f394b0535ed685af66197b/experimental/cost_model.py#L73-L76
Hello! Thank you for sharing your great work!
I have a question. I want to calculate a cost_model.py on other GPU (e.g. A6000, A100...).
They have different FLOPs and GPU RAM memory bandwidth. But in the cost_model.py,
*mm_flops*
are just magic numbers, and it seems that don't consider GPU RAM bandwidth.Is there any method about calculating
*mm_flops*
?Thank you.