-
### 🚀 The feature, motivation and pitch
Support tensor-parallelism in QLoRA on vllm.
### Alternatives
_No response_
### Additional context
_No response_
-
### Proposal to improve performance
Propose synchronizing the broadcast of tensor_dict at the beginning of each decoding step or block the process after broadcast.
### Report of performance regr…
-
### Your current environment
- `vllm==0.5.3.post1`
- `python=3.9`
### 🐛 Describe the bug
When using distributed_executor_backend=mp with VLLM version `vllm==0.5.3.post1,` the process doe…
-
### System Info
```shell
in the run_generation.py (for text generation) How can i know that what kind of parallelism it is like data or tensor? and is there a chance to shift in between these two?
``…
-
# 🚀 Feature request
Splitting the discussion that started here: https://github.com/huggingface/transformers/pull/10301#issuecomment-782917393 to add the potential future feature of transformers and…
-
Hi,
Im trying to get an example working with Ray on Databricks. Essentially having multiple replicas of the model. Is it possible to load a model with tensor parallelism inside a notebook?
Thank…
-
I've been using `atq.INT4_AWQ_CFG` and observing a performance drop when quantizing a Llama 70B model with tensor parallelism with`atq.quantize(model, quant_cfg, forward_loop=calibrate_loop)`.
Quan…
-
### Model Series
Qwen2
### What are the models used?
Qwen2-57B-A14B
### What is the scenario where the problem happened?
train with transformers
### Is this a known issue?
- [X] I have followed…
-
Hi,
I've noticed that you have implemented that allows for the overlapping of computation and communication in tensor parallel operations. This is a significant enhancement that has the potential t…
-
Would it be possible in this framework that the pipeline is incorporated to tensor parallelism or zero data parallelism?