Closed zhaoyang-star closed 1 month ago
Sorry for my late response. Decoding is a notoriously memory-bound problem, using tensor core sometimes is a waste.
See this question on zhihu.com (in chinese) https://www.zhihu.com/question/636533414/answer/3345355574
When using KV Cache, the input ids shape becomes
[1, seq_len]
, where:seq_len
could only be1
or2
for draft modelseq_len
could only begamma+1
for target modelAt this time, we could only use CUDA Core rather than Tensor Core to accelerate the decoding kernel for draft model inference.
As tensor core is much faster, is there any method to make
seq_len
larger to use tensor core?