Open zhouheyun opened 2 months ago
Our open-source code (https://github.com/vllm-project/vllm/pull/4650) is not the inference code used in the API platform, so it cannot achieve the throughput speed mentioned in the paper. @zhouheyun
Our open-source code (vllm-project/vllm#4650) is not the inference code used in the API platform, so it cannot achieve the throughput speed mentioned in the paper. @zhouheyun
What‘s the average inference context length to achieve the claimed throughput in the paper? @luofuli
32K context length @zhouheyun
I have a few questions about the inference efficiency of deepseek v2
Are all the storage and computation performed in fp8 ? Does this harm the performance of the model?
Is this throughput achieved using testing request of 128K context length? Can we reproduce it using https://github.com/vllm-project/vllm/pull/4650