InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.3k stars 386 forks source link

Optimize sampling on pytorch engine. #1853

Closed grimoire closed 2 months ago

grimoire commented 3 months ago

sort logits of large vocab size would take long time on GPU. This PR gather topk logits and slice scores after topp

llama2-13b (vocab_size=32000):

before

token throughput (completion token): 1462.612 token/s
token throughput (prompt + completion token): 3031.211 token/s
RPS (request per second): 6.131 req/s
RPM (request per minute): 367.831 req/min

after

token throughput (completion token): 1482.363 token/s
token throughput (prompt + completion token): 3072.144 token/s
RPS (request per second): 6.213 req/s
RPM (request per minute): 372.799 req/min

half the sampling time. Larger vocab_size would have better performance since radix sort has O(nk) complexity