Question on original DistServe paper - communication overhead

LLMServe / DistServe

Disaggregated serving system for Large Language Models (LLMs).

Apache License 2.0

296 stars 32 forks source link

Closed lxldavid91 closed 1 month ago

lxldavid91 commented 2 months ago

Hi, first of all thanks for the great work.

I have been deep diving your paper and generated following 2 questions:

I wonder how this 90Gbps was calculated? it's generated by real poc test or projections?
Are we overlapping kv cache transmission with prefill calculation to hide the latency? In this plot, it seems that we're not hiding the overhead. Plus, do we transmit kv cache layer by layer? (or transmit after entire prefill calculation is done.)

Looking forward to your reply:) Thank you in advance.

hyuenmin-choi commented 2 months ago

Hey I'm also one of the deep diver of this work.

I think 11.3GB = 90.4 Gb, so they wrote it as 90Gbps in this paper
If anything, the analysis in Splitwise [ISCA'24], which has a similar design, suggests that if the length of the KV cache does not exceed a certain length (maybe 512 seq length in my memory), sending it to one chunk is advantageous for transmission latency, and a larger KV cache is transmitted to layer-by-layer. Of course, overlapping transmission and calculation is advantageous for latency hiding, but DistServe doesn't seem to be covering this specifically. If you want to see a more specific RDMA kv cache transmission, I recommend you to read Splitwise.

interestingLSY commented 1 month ago

90 Gbps = 11.3GB * 8 / 1s
If we focus on one request, we are not hiding the kv cache transmission latency (since it's on the critical path). This is acceptable since the overhead of transmission is negligible (as the bar chart above). In the meantime, DistServe is able to perform prefill phase & decoding phase on other requests while one request is being transmitted.
In the current design of DistServe, we transmit the kv cache after the we finish the prefill calculation entirely.