Open ZHENG518 opened 5 months ago
Thank you for your interest in our work. We are aware of the challenges in implementing KV compression on current open-source code and are actively working on it. The HuggingFace's code is not as efficient as we would like, so we're developing a new open-source code using vLLM for better performance. The open-source vLLM code including KV compression will be released once it is ready.
Can it support quantitative deployment? GPTQ or AWQ?
hi, we have support vllm in this pr(https://github.com/vllm-project/vllm/pull/4650)
hi, we have support vllm in this pr(vllm-project/vllm#4650)
Thank you for your great work. According to your document description: the actual deployment on an 8*H800 machine has an input throughput of more than 100,000 tokens/s and an output throughput of more than 50,000 tokens/s . Can we achieve such excellent performance with this vllm?
Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.
Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.
8x80G,8*40G only work for 4bit model
Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.
8x80G,8*40G only work for 4bit model
got it, thank you~
Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.
8x80G,8*40G only work for 4bit model
4bit model ? we don't get it
hi, we have support vllm in this pr(vllm-project/vllm#4650)
I failed to use vllm 0.4.2 for inference and reported the following error:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
File "/data0/zhenglin/src/asr-anlp-autovision-model3/src/local_inference/deepseek_infer.py", line 8, in
hi, we have support vllm in this pr(vllm-project/vllm#4650)
I failed to use vllm 0.4.2 for inference and reported the following error:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Traceback (most recent call last): File "/data0/zhenglin/src/asr-anlp-autovision-model3/src/local_inference/deepseek_infer.py", line 8, in llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True) File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 123, in init self.llm_engine = LLMEngine.from_engine_args( File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 272, in from_engine_args engine_config = engine_args.create_engine_config() File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 544, in create_engine_config speculative_config = SpeculativeConfig.maybe_create_spec_config( TypeError: SpeculativeConfig.maybe_create_spec_config() missing 1 required positional argument: 'speculative_disable_by_batch_size'
Same problem
Solved by checking the engine/arg_utils.py
file.
@BasicCoder do you implement such speed?
Thank you for your interest in our work. We are aware of the challenges in implementing KV compression on current open-source code and are actively working on it. The HuggingFace's code is not as efficient as we would like, so we're developing a new open-source code using vLLM for better performance. The open-source vLLM code including KV compression will be released once it is ready.