如何能达到论文里说的吞吐量50000多tokens

deepseek-ai / DeepSeek-V2

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

MIT License

3.47k stars 143 forks source link

如何能达到论文里说的吞吐量50000多tokens #35

Open ly19970621 opened 4 months ago

ly19970621 commented 4 months ago

硬件：H800 PCIE * 8 我使用vllm推理最多只能达到1500tokens/s，batch_size为1024，请问怎样才能达到论文里说的50000多tokens？

haichuan1221 commented 4 months ago

你好，vllm是否能够跑起来呢? 是否有做量化呢? 另外PCIE的带宽比较低，做tensor parallel的话，可能会比较慢; 论文里面的H100多半是nvlink连接的8卡主机

硬件：H800 PCIE * 8 我使用vllm推理最多只能达到1500tokens/s，batch_size为1024，请问怎样才能达到论文里说的50000多tokens？

ly19970621 commented 4 months ago

你好，vllm是否能够跑起来呢? 是否有做量化呢? 另外PCIE的带宽比较低，做tensor parallel的话，可能会比较慢; 论文里面的H100多半是nvlink连接的8卡主机

硬件：H800 PCIE * 8 我使用vllm推理最多只能达到1500tokens/s，batch_size为1024，请问怎样才能达到论文里说的50000多tokens？就是使用vllm跑的，还要专门做量化嘛？如果需要量化的话，可以开源量化后的模型嘛？或者提供一下量化方式，是AWQ还是GPTQ？对于并行方式，推理是选择张量并行还是流水线并行？另外我在8卡SXM（nvlink）的A800跑也是1500tokens/s，一样用得vllm，每个卡之间的网络带宽是400GB。

luofuli commented 4 months ago

In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of FP8. In addition, we also perform KV cache quantization (Hooper et al., 2024; Zhao et al., 2023) for DeepSeek-V2 to further compress each element in its KV cache into 6 bits on average.

halexan commented 2 months ago

In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of FP8. In addition, we also perform KV cache quantization (Hooper et al., 2024; Zhao et al., 2023) for DeepSeek-V2 to further compress each element in its KV cache into 6 bits on average.

How to convert parameters to FP8？ Any example？

fengyang95 commented 1 month ago

In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of FP8. In addition, we also perform KV cache quantization (Hooper et al., 2024; Zhao et al., 2023) for DeepSeek-V2 to further compress each element in its KV cache into 6 bits on average.

How to convert parameters to FP8？ Any example？

maybe you can ref this : https://hf-mirror.com/neuralmagic/DeepSeek-Coder-V2-Lite-Instruct-FP8

halexan commented 1 month ago

In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of FP8. In addition, we also perform KV cache quantization (Hooper et al., 2024; Zhao et al., 2023) for DeepSeek-V2 to further compress each element in its KV cache into 6 bits on average.

How to convert parameters to FP8？ Any example？

maybe you can ref this : https://hf-mirror.com/neuralmagic/DeepSeek-Coder-V2-Lite-Instruct-FP8

DeepSeek-Coder-V2-Instruct-FP8 does not support A100. However, H100 currently not available for me. Anyone else can test it?

Currently we don't support MoE FP8 models on Ampere. This is because vLLM uses Triton for its FusedMoE kernel, which doesn't support the FP8 Marlin mixed-precision gemm.

See: https://hf-mirror.com/neuralmagic/DeepSeek-Coder-V2-Instruct-FP8/discussions/1