Closed tdoublep closed 4 months ago
Users are seeing runtime errors when trying to use TP>1 with speculative decoding.
We need to set the tensor parallel argument correctly when we instantiate the PagedKVCacheManager.
I have verified that this change resolves the reported issue.
https://huggingface.co/ibm-fms/llama3-8b-accelerator/discussions/1
Motivation
Users are seeing runtime errors when trying to use TP>1 with speculative decoding.
Modifications
We need to set the tensor parallel argument correctly when we instantiate the PagedKVCacheManager.
Result
I have verified that this change resolves the reported issue.
Related Issues
https://huggingface.co/ibm-fms/llama3-8b-accelerator/discussions/1