James-QiuHaoran / LLM-serving-with-proxy-models

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction
Apache License 2.0
17 stars 5 forks source link

Evaluation method for the scheduling side - trace driven simulation or real world? #1

Closed saeid93 closed 5 months ago

saeid93 commented 5 months ago

Dear Authors,

Thank you for sharing the source code. It is a very interesting research topic and I'm looking forward to see your future works on the subject. I have a question regarding the method used for evaluating the performance of the SSJF algorithm in terms of throughput and latency. In the evaluation section of that paper (Section 4) of the paper I see that it is mentioned that : "We deploy SSJF on an IBM Cloud gx2-16x128x2v100 instance with 2 NVIDIA Tesla V100 (16GB) GPUs" Looking at the code I am a bit puzzled whether the scheduling performance evaluation was a trace-driven simulation as I can understand from the code under model-serving folder or it was deployed on the real world cluster and that part of the code is not public? If the results in the paper are coming from the same a trace-driven simulation in the repo and there is no other code for deployment on the GPUs then does mentioning that "using two V100" in the evaluation section mean that the trace used in simulation under prediction/final were profiled on a V100 GPUs?

James-QiuHaoran commented 5 months ago

Hi @saeid93, The released code contains the scheduler simulation that supports uniform, poison, and gamma distributions with configuration parameters. Trace-driven simulation is on the way (stay tuned!), and the traces we are looking at are Azure LLM inference traces.

saeid93 commented 5 months ago

@James-QiuHaoran Awesome, looking forward to it!

Could you please also let me know my question about whether the results of the paper are extracted from the simulator (figure 7) or a scheduler for real-world clusters was deployed (not simulation)? My guess is that the scheduling side of the paper is done on simulation but since you mentioned you have used two GPUs on your cluster I'm a bit confused:

"Testbed. We deploy SSJF on an IBM Cloud gx2-16x128x2v100 instance with 2 NVIDIA Tesla V100 (16GB) GPUs. Each GPU supports a maximum Streaming Multiprocessor (SM) frequency of 1380 MHz and a minimum of 200 MHz."

Are the GPUs used in section 4 are only for the sequence length prediction part and the scheduling side is simulated using the simulator in this repo?

James-QiuHaoran commented 5 months ago

The GPUs are used for model deployment and token prediction (since we need to run it to get the ground truth output token sequence length for SJF), as well as the per-token generation latency (deemed as constant for a deployed model-serving instance across prompts) required for running the simulator.

saeid93 commented 5 months ago

Thank you!