-
### Feature Description
from llama_index.core.llms.vllm import VllmServer
from llama_index.core.llms import ChatMessage
llm = VllmServer(api_url="http://localhost:8000", max_new_tokens=8000, temp…
-
Solid idea and Ingenious code implementations, Great work!
Have you considered implementing KV Compression operations on KV Cache in the vLLM framework?
-
Hello, nice work and very helpful! Does this support vllm for fast generation?
-
**Describe the bug**
I'm hitting an illegal memory access in https://github.com/vllm-project/vllm/pull/5917 when setting fuse_reduction=False in the fused GEMM+ReduceScatter kernel.
**To Reproduce…
-
你好,在最近的测试中,我在A100上测试Llama-13b、7b等模型,对比vllm和distserve, 在满足slo的情况下, distserve性能要优于vllm,但是在测试codellama-34b过程中,当我的输入长度为8192,发现TTFT要高出vllm约3倍左右,请问这个情况是正常的吗?vllm使用tp2, distserve使用prefill tp2, decode tp2。
-
**Describe the bug**
After change configuration in config.yaml. Run 'ilab xxx --help' , the default is not consistent with config.yaml. E.g. change default serve model to mixtral, the help message st…
-
videochat2这种多模态有什么办法使用vllm部署,vllm现在好像不支持embedding输入
-
While working on the addition of vLLM https://github.com/instructlab/instructlab/pull/1442, I tried adding func test to the e2e test since the runner has a CUDA GPU. Unfortunately, it does not have en…
leseb updated
9 hours ago
-
I'm trying to implement control vector into vllm codebase for mixtral model, but I was wondering where should I add the control vector to the layer. Should it be added before attention, fully connecte…
-
I'm wondering if I can get an easier pipeline by loading the awq weights with vllm:
```
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the Uni…