-
### š The feature, motivation and pitch
Any QLoRA adapters trained on large checkpoints (e.g., 70B) are unusable as we cannot use TP>1 to shard the model over multiple GPUs. Therefore, resolving thisā¦
-
### Your current environment
I am trying out FP8 support on AMD GPUs (MI250, MI300) and the vLLM library does not seem to support AMD GPUs yet for FP8 quantization. Is there any timeline for when thiā¦
-
### Motivation
I wanted to deploy deploy Mixtral8x22B with quantization but it says that lmdeploy doesn't support Mixtral8x22B model.
### Related resources
_No response_
### Additional context
_Nā¦
zekih updated
3 months ago
-
Hi thanks for the lib! When checking https://github.com/vllm-project/llm-compressor/issues/935, it seems that `one_shot` auto saves everything to the output folder. That looks great, but if I understaā¦
-
The part that's a bit confusing is dynamic indexing. For consistency, the underlying integers still need to be scaled before becoming the indexing integers.
-
I saw it compiled, it can increase 20% performance on flux, but it seems that it has no effect on cogvideo 1.5
the quantization is fp8, faster cache is enabled
-
### System Info / ē³»ēµ±äæ”ęÆ
SERVER:Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
PRETTY_NAME:"Debian GNU/Linux 11 (bullseye)"
python:3.11.5
conda:23.10.0
torch:2.4.1+cpu
### Running Xinference with Dā¦
-
### Your current environment
vllm==0.6.3.post1
### Model Input Dumps
```bash
ValueError: Weight input_size_per_partition = 10944 is not divisible by min_thread_k = 128. Consider reducing tensor_paā¦
-
TensorRT-LLM has great potential for allowing people to run larger models efficiently with limited hardware resources. Unfortunately, the current quantization workflow requires significant computationā¦
-
**Bug description.**
When trying to pull a specific quantization tag for a model through Ollama I was getting the following error: `The specified tag is not a valid quantization scheme.`
At first ā¦