-
We have an HQQ 4-bit version of the Aria model: https://github.com/mobiusml/hqq/blob/master/examples/hf/aria_multimodal.py
It's working great, but we need `torch.compile` support so it can run much f…
-
### Describe the bug
I've gone through all the steps to install Sora and the last step of running gradio/app.py it fails about 2/3 of the way. It hangs on loading shards at 0% and then get the follow…
-
@masajiro こんにちは you are the wise sensei of vector search whose NGT tops hnsw based popular engines on benchmarks. I am curious if you think this approach can work to limit ram size need. Also a good n…
-
trying to quantize and no model is generated
my hardware is amd
```python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int8
Loading model ...
Quantizing model weights f…
-
I have finetuned the llama 3.1 using unsloth. Then, i merged and unloaded the LORA model and pushed to the hub.
Now when i tried quantizing it using:
```
from awq import AutoAWQForCausalLM
qua…
-
Accuracy for normal resnet50.onnx model is coming out to be above 70% but after quantizing it, accuracy becomes 0.10%.. What could be the issue?
Any help would be appreciated
-
### Describe the feature request
Support for quantizing and running quantized models in 4bit, 2bit and 1bit. Also saving and loading these models in onnx format for lower file sizes.
The GPU doesn…
-
### Your current environment
...
### How would you like to use vllm
I have downloaded a model. Now on my 4 GPU instance I attempt to quantize it using AutoAWQ.
Whenever I run the script below I ge…
-
Quantizing KV cache in LLM inference is a common method to boost performance. I noticed that FA has supported paged kv cache. Should we support fp8 or int8 kv cache?
-
Where in the codebase might I find the basic arithmetic / steps for quantizing with NF4?
I’ve had trouble finding a clear definition of the math in existing tutorials, but based on what I see in th…