Reproduce inference benchmark mentioned in the paper

zhouheyun commented 2 months ago

I have a few questions about the inference efficiency of deepseek v2

In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of FP8.

Are all the storage and computation performed in fp8 ? Does this harm the performance of the model?

On a single node with 8 H800 GPUs, DeepSeek-V2 achieves a generation throughput exceeding 50K tokens per second, which is 5.76 times the maximum generation throughput of DeepSeek 67B. In addition, the prompt input throughput of DeepSeek-V2 exceeds 100K tokens per second.

Is this throughput achieved using testing request of 128K context length? Can we reproduce it using https://github.com/vllm-project/vllm/pull/4650

luofuli commented 2 months ago

Our open-source code (https://github.com/vllm-project/vllm/pull/4650) is not the inference code used in the API platform, so it cannot achieve the throughput speed mentioned in the paper. @zhouheyun

zhouheyun commented 2 months ago

Our open-source code (vllm-project/vllm#4650) is not the inference code used in the API platform, so it cannot achieve the throughput speed mentioned in the paper. @zhouheyun

What‘s the average inference context length to achieve the claimed throughput in the paper? @luofuli

luofuli commented 1 month ago

32K context length @zhouheyun

deepseek-ai / DeepSeek-V2

Reproduce inference benchmark mentioned in the paper #21