Closed grimoire closed 2 months ago
sort logits of large vocab size would take long time on GPU. This PR gather topk logits and slice scores after topp
llama2-13b (vocab_size=32000):
before
token throughput (completion token): 1462.612 token/s token throughput (prompt + completion token): 3031.211 token/s RPS (request per second): 6.131 req/s RPM (request per minute): 367.831 req/min
after
token throughput (completion token): 1482.363 token/s token throughput (prompt + completion token): 3072.144 token/s RPS (request per second): 6.213 req/s RPM (request per minute): 372.799 req/min
half the sampling time. Larger vocab_size would have better performance since radix sort has O(nk) complexity
sort logits of large vocab size would take long time on GPU. This PR gather topk logits and slice scores after topp
llama2-13b (vocab_size=32000):
before
after
half the sampling time. Larger vocab_size would have better performance since radix sort has O(nk) complexity