kv-cache-quantization Search Results

1000+ results
for kv-cache-quantization

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

vllm-project/vllm #10294

[Feature]: Quark quantization format upstream to VLLM

Quark is a comprehensive cross-platform toolkit designed to simplify and enhance the quantization of deep learning models. Supporting both PyTorch and ONNX models, Quark empowers developers to optimiz…

kewang-xlnx updated 3 days ago
5
google-ai-edge/ai-edge-torch #369

Quantization of Llama results in TFLite file without prefill…

### Description of the bug: I tried running the example.py script given for quantization example, but for Llama. Wherever the reference to Gemma was made, I made appropriate references to Llama. The…

Arya-Hari updated 3 days ago
6
sgl-project/sglang #1964

[Feature] Is AWQ W4Afp8 supported?

### Checklist - [x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.…

vkc1vk updated 3 weeks ago
1
bd-iaas-us/vllm #18

[Bug]: I get an error when I try to build the Docker Image f…

### Your current environment ```text PyTorch version: 2.1.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Microsoft Windows 11 Home GCC vers…

mkhammoud updated 2 weeks ago
2
QwenLM/Qwen2.5 #1101

"llama_model_load: error loading model: check_tensor_dims: t…

When I quantified the Qwen2.5-1.5B-instruct model according to **"Quantizing the GGUF with AWQ Scale"** of [docs](https://qwen.readthedocs.io/en/latest/quantization/llama.cpp.html) , it showed that th…

Autism-al updated 6 days ago
2
astramind-ai/Auralis #5

Encountered exception while importing TTS: No module named '…

Trying to execute a test code on WSL2 (Ubuntu 22.04). Getting this issue: ``` from .autonotebook import tqdm as notebook_tqdm 2024-11-30 22:10:56,948 INFO util.py:154 -- Missing packages: ['ip…

dokluch updated 36 minutes ago
2
casper-hansen/AutoAWQ #655

"llama_model_load: error loading model: check_tensor_dims: t…

When I quantified the Qwen2.5-1.5B-instruct model according to "GGUF Export" in the examples.md in the docs, it showed that the quantization was complete and I obtained the gguf model.But when I load …

Autism-al updated 6 days ago
1
vllm-project/vllm #7469

[Feature]: ROCm 6.2 support & FP8 Support

### 🚀 The feature, motivation and pitch Last week AMD announced rocm 6.2 (https://rocm.docs.amd.com/en/latest/about/release-notes.html) also announcing expanded support for VLLM & FP8. Actuall…

ferrybaltimore updated 6 days ago
4
ggerganov/llama.cpp #10378

Bug: flash-attn can't use

### What happened? I want to quantize KV cache in the form of q8_0, but the following error occurs: llama_new_context_with_model: V cache quantization requires flash_attn common_init_from_params…

Tangzhongyi834 updated 1 week ago
2
vllm-project/vllm #5751

[RFC]: Support sparse KV cache framework

### Motivation For current large model inference, KV cache occupies a significant portion of GPU memory, so reducing the size of KV cache is an important direction for improvement. Recently, severa…

chizhang118 updated 1 month ago
16

上一页 1...1 2 3 4 5 6 7...100 下一页

1000+ results for kv-cache-quantization

1000+ results
for kv-cache-quantization