deepseek-ai / DeepSeek-V2

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
MIT License
3.47k stars 143 forks source link

How to deploy in VLLM? #7

Open ZHENG518 opened 5 months ago

stack-heap-overflow commented 5 months ago

Thank you for your interest in our work. We are aware of the challenges in implementing KV compression on current open-source code and are actively working on it. The HuggingFace's code is not as efficient as we would like, so we're developing a new open-source code using vLLM for better performance. The open-source vLLM code including KV compression will be released once it is ready.

Xu-Chen commented 5 months ago

Thank you for your interest in our work. We are aware of the challenges in implementing KV compression on current open-source code and are actively working on it. The HuggingFace's code is not as efficient as we would like, so we're developing a new open-source code using vLLM for better performance. The open-source vLLM code including KV compression will be released once it is ready.

Can it support quantitative deployment? GPTQ or AWQ?

zwd003 commented 5 months ago

hi, we have support vllm in this pr(https://github.com/vllm-project/vllm/pull/4650)

BasicCoder commented 5 months ago

hi, we have support vllm in this pr(vllm-project/vllm#4650)

Thank you for your great work. According to your document description: the actual deployment on an 8*H800 machine has an input throughput of more than 100,000 tokens/s and an output throughput of more than 50,000 tokens/s . Can we achieve such excellent performance with this vllm?

Ricardokevins commented 4 months ago

Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.

zwd003 commented 4 months ago

Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.

8x80G,8*40G only work for 4bit model

Ricardokevins commented 4 months ago

Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.

8x80G,8*40G only work for 4bit model

got it, thank you~

ccp123456789 commented 4 months ago

Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.

8x80G,8*40G only work for 4bit model

4bit model ? we don't get it

ZhangYaoFu commented 4 months ago

hi, we have support vllm in this pr(vllm-project/vllm#4650)

I failed to use vllm 0.4.2 for inference and reported the following error:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Traceback (most recent call last): File "/data0/zhenglin/src/asr-anlp-autovision-model3/src/local_inference/deepseek_infer.py", line 8, in llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True) File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 123, in init self.llm_engine = LLMEngine.from_engine_args( File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 272, in from_engine_args engine_config = engine_args.create_engine_config() File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 544, in create_engine_config speculative_config = SpeculativeConfig.maybe_create_spec_config( TypeError: SpeculativeConfig.maybe_create_spec_config() missing 1 required positional argument: 'speculative_disable_by_batch_size'

yukiwayx commented 4 months ago

hi, we have support vllm in this pr(vllm-project/vllm#4650)

I failed to use vllm 0.4.2 for inference and reported the following error:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Traceback (most recent call last): File "/data0/zhenglin/src/asr-anlp-autovision-model3/src/local_inference/deepseek_infer.py", line 8, in llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True) File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 123, in init self.llm_engine = LLMEngine.from_engine_args( File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 272, in from_engine_args engine_config = engine_args.create_engine_config() File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 544, in create_engine_config speculative_config = SpeculativeConfig.maybe_create_spec_config( TypeError: SpeculativeConfig.maybe_create_spec_config() missing 1 required positional argument: 'speculative_disable_by_batch_size'

Same problem

Solved by checking the engine/arg_utils.py file.

KylinMountain commented 1 month ago

@BasicCoder do you implement such speed?