Closed tszdanger closed 10 months ago
Hi @tszdanger
Thanks for your attention. I must admit that my implementation of this project is very rough, just to see what the output of Speculative Sampling looks like. There are some obvious inefficient implementations, such as not using KV Cache to reduce computation when conducting forward on small models, which leads to unreal relative time costs between the AutoRegressive and the Speculative Sampling.
I quickly added some KVCache optimizations. However, it's worth noting that the KV Cache of the target model still requires some careful design.
Actually, we should implement KV cache for the forward passes of the target model. If reject sampling occurs, we need to rollback the KVCache. I haven't implemented this feature yet. It's quite intriguing and worth studying in conjunction with batching.
Hi @feifeibear
Thanks for your kind reply. I couldn't agree more with your suggestion. The key speedup here, from my understanding, is on parallel faster(for sure, KVcache is one of the most important which aims to assist this process.)
I think I will close this as we clearly discussed this topic.
Hi jiarui,
Thanks for your implementation, the code is neat and beautiful!
However, I've got one question. According to my understanding, the google speculative sampling should not be slower than the original target model when running autoregressive sampling. In fact, if we use some tiny models such as pythia-70m and pythia-4b. (https://huggingface.co/EleutherAI/pythia-70m) It turns out that the time cost of speculative sampling is much higher.
I would appreciate your thoughts and suggestions on this matter.
Yours, Zonkey