-
Reproducing steps:
1. Clone the vllm repo and switch to [tag v0.3.1](https://github.com/vllm-project/vllm/tree/v0.3.1)
2. Build the Dockerfile.rocm dockerfile with instructions from [Option 3: Bui…
-
Hi, I am running and profiling the code of the Mixtral implementation, however, neither in the code nor in the profiling, did I find any Alltoall operations.
I built the TRT engine using the follo…
-
Hi,
I remember the support on vLLM was on your TODOs. Have you achieved it now? Was the main challenge in this direction that the batch size > 1 tree verification is hard to made efficient? Thanks…
-
Currently Intel offers an A770 for $300 with 16gb of ram and much better flops than a 4060ti ($500).
From what I hear over at tinygrad, they have much better drivers than AMD.
We should support Int…
-
I deployed wizardLLM-70b which is fine-tuned variant of llama2-70b on 4 A100 (80 GB) using vLLM worker. I noticed a much slower response (more than a minute even for a simple prompt like Hi) at a thro…
-
I am trying to deploy a Baichuan2-7B model on a machine with 2 Tesla V100 GPUs. Unfortunately each V100 has only 16GB memory.
I have applied INT8 weight-only quantization, so the size of the engine I…
-
**Describe the bug**
When the provided example script is configured to use pipeline parallelism, two different behaviours are observed.
1. When tensor parallelism (tp) = 1 and pipeline parallelism (…
-
Hey vllm team,
Hope you're all doing great! I‘m focusing on pipeline parallel inference and I hope it can be support on vllm.
I noticed that pipeline parallelism was on the old roadmap(#244) , b…
-
The example should show tensor parallelism. I am not sure if Serve + vLLM + tensor parallelism works at the moment because the Serve deployment will request N GPUs, then each vLLM worker will request …
-
## 🚀 Description
Pipeline parallelism is a technique used in deep learning model training to improve efficiency and reduce the training time of large neural networks. Here we propose a pipeline paral…