feifeibear / LLMSpeculativeSampling

Fast inference from large lauguage models via speculative decoding
415 stars 46 forks source link

How to use Tensor Core to accelerate Speculative Sampling? #25

Closed zhaoyang-star closed 1 month ago

zhaoyang-star commented 7 months ago

When using KV Cache, the input ids shape becomes [1, seq_len], where:

At this time, we could only use CUDA Core rather than Tensor Core to accelerate the decoding kernel for draft model inference.

As tensor core is much faster, is there any method to make seq_len larger to use tensor core?

feifeibear commented 4 months ago

Sorry for my late response. Decoding is a notoriously memory-bound problem, using tensor core sometimes is a waste.

See this question on zhihu.com (in chinese) https://www.zhihu.com/question/636533414/answer/3345355574