quantizing Search Results

1000+ results
for quantizing

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

rhymes-ai/Aria #54

Support torch.compile

We have an HQQ 4-bit version of the Aria model: https://github.com/mobiusml/hqq/blob/master/examples/hf/aria_multimodal.py It's working great, but we need `torch.compile` support so it can run much f…

mobicham updated 7 hours ago
3
gradio-app/gradio #9378

Hangs at loading shards then get a OOM error.

### Describe the bug I've gone through all the steps to install Sora and the last step of running gradio/app.py it fails about 2/3 of the way. It hangs on loading shards at 0% and then get the follow…

blacknoon updated 1 month ago
1
yahoojapan/NGT #168

Reducing RAM cost 95% searching one cluster at a time and i…

@masajiro こんにちは you are the wise sensei of vector search whose NGT tops hnsw based popular engines on benchmarks. I am curious if you think this approach can work to limit ram size need. Also a good n…

vtempest updated 1 month ago
1
pytorch-labs/gpt-fast #6

AMD quantize

trying to quantize and no model is generated my hardware is amd ```python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int8 Loading model ... Quantizing model weights f…

rraulison updated 11 months ago
4
casper-hansen/AutoAWQ #583

Converting finetuned Llama 3.1 using LORA into AWQ

I have finetuned the llama 3.1 using unsloth. Then, i merged and unloaded the LORA model and pushed to the hub. Now when i tried quantizing it using: ``` from awq import AutoAWQForCausalLM qua…

fusesid updated 2 months ago
2
microsoft/onnxruntime-inference-examples #319

Issue with quantize_static

Accuracy for normal resnet50.onnx model is coming out to be above 70% but after quantizing it, accuracy becomes 0.10%.. What could be the issue? Any help would be appreciated

arpdubey updated 12 months ago
8
microsoft/onnxruntime #14997

[Feature Request] 4bit and 2bit and 1bit quantization suppor…

### Describe the feature request Support for quantizing and running quantized models in 4bit, 2bit and 1bit. Also saving and loading these models in onnx format for lower file sizes. The GPU doesn…

elephantpanda updated 3 months ago
24
vllm-project/vllm #4744

[Usage]: Vllm AutoAWQ with 4-GPU doesnt utilize GPU

### Your current environment ... ### How would you like to use vllm I have downloaded a model. Now on my 4 GPU instance I attempt to quantize it using AutoAWQ. Whenever I run the script below I ge…

danielstankw updated 2 weeks ago
2
Dao-AILab/flash-attention #797

Support quantized KV cache in LLM inference?

Quantizing KV cache in LLM inference is a common method to boost performance. I noticed that FA has supported paged kv cache. Should we support fp8 or int8 kv cache?

zhaoyang-star updated 9 months ago
4
bitsandbytes-foundation/bitsandbytes #1283

Clarifying the quantization algorithm

Where in the codebase might I find the basic arithmetic / steps for quantizing with NF4? I’ve had trouble finding a clear definition of the math in existing tutorials, but based on what I see in th…

chrisjmccormick updated 3 months ago
1

上一页 1...8 9 10 11 12 13 14...100 下一页

1000+ results for quantizing

1000+ results
for quantizing