feifeibear / LLMSpeculativeSampling

Fast inference from large lauguage models via speculative decoding
415 stars 46 forks source link

Parallel question #2

Closed tszdanger closed 10 months ago

tszdanger commented 10 months ago

Hi jiarui,

Thanks for your implementation, the code is neat and beautiful!

However, I've got one question. According to my understanding, the google speculative sampling should not be slower than the original target model when running autoregressive sampling. In fact, if we use some tiny models such as pythia-70m and pythia-4b. (https://huggingface.co/EleutherAI/pythia-70m) It turns out that the time cost of speculative sampling is much higher.

I would appreciate your thoughts and suggestions on this matter.

Yours, Zonkey

feifeibear commented 10 months ago

Hi @tszdanger

Thanks for your attention. I must admit that my implementation of this project is very rough, just to see what the output of Speculative Sampling looks like. There are some obvious inefficient implementations, such as not using KV Cache to reduce computation when conducting forward on small models, which leads to unreal relative time costs between the AutoRegressive and the Speculative Sampling.

I quickly added some KVCache optimizations. However, it's worth noting that the KV Cache of the target model still requires some careful design.

3 #4

feifeibear commented 10 months ago

Actually, we should implement KV cache for the forward passes of the target model. If reject sampling occurs, we need to rollback the KVCache. I haven't implemented this feature yet. It's quite intriguing and worth studying in conjunction with batching.

tszdanger commented 10 months ago

Hi @feifeibear

Thanks for your kind reply. I couldn't agree more with your suggestion. The key speedup here, from my understanding, is on parallel faster(for sure, KVcache is one of the most important which aims to assist this process.)

I think I will close this as we clearly discussed this topic.