Mutinifni / splitwise-sim

LLM serving cluster simulator
MIT License
66 stars 4 forks source link

Question regarding the performance model #2

Open hdliu21 opened 2 days ago

hdliu21 commented 2 days ago

Thanks for the great work. Regarding the performance model, linear predictors are utilized to predict the latency of the prompt and token phases. When I looked into the implementation of the token time predictor, I found that the batch_tokens variable is calculated as the number of tasks in the batch (https://github.com/Mutinifni/splitwise-sim/blob/8f99e7dc9b407f4ce2488d03dd44c0b8b946dab0/performance_model.py#L230), which is generally a small number from 1 to dozens. However, the token time predictor is built based on the prompt size, which ranges from 128 to 32768 (https://github.com/Mutinifni/splitwise-sim/blob/8f99e7dc9b407f4ce2488d03dd44c0b8b946dab0/performance_model.py#L117). Therefore, there is a mismatch between the range of the key used to build and predict the token time. However, the token time can be closely related to the KV cache size, so I guess we should also use batched prompt size to predict the token time, similar to how it's done for prompt time prediction (https://github.com/Mutinifni/splitwise-sim/blob/8f99e7dc9b407f4ce2488d03dd44c0b8b946dab0/performance_model.py#L227). Could you confirm if my understanding is correct? Thanks.

Mutinifni commented 1 day ago

The time depends on the number of new tokens being processed in the iteration. For decode phase (that is, TokenTask), each request/task is only generating 1 new token in each iteration. This is why decode phases underutilize compute and benefit from larger batch sizes. PromptTasks process several new tokens together in each iteration (up to the prompt size).