-
Quark is a comprehensive cross-platform toolkit designed to simplify and enhance the quantization of deep learning models. Supporting both PyTorch and ONNX models, Quark empowers developers to optimiz…
-
### Description of the bug:
I tried running the example.py script given for quantization example, but for Llama. Wherever the reference to Gemma was made, I made appropriate references to Llama. The…
-
### Checklist
- [x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.…
-
### Your current environment
```text
PyTorch version: 2.1.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Microsoft Windows 11 Home
GCC vers…
-
When I quantified the Qwen2.5-1.5B-instruct model according to **"Quantizing the GGUF with AWQ Scale"** of [docs](https://qwen.readthedocs.io/en/latest/quantization/llama.cpp.html) , it showed that th…
-
Trying to execute a test code on WSL2 (Ubuntu 22.04).
Getting this issue:
```
from .autonotebook import tqdm as notebook_tqdm
2024-11-30 22:10:56,948 INFO util.py:154 -- Missing packages: ['ip…
-
When I quantified the Qwen2.5-1.5B-instruct model according to "GGUF Export" in the examples.md in the docs, it showed that the quantization was complete and I obtained the gguf model.But when I load …
-
### 🚀 The feature, motivation and pitch
Last week AMD announced rocm 6.2 (https://rocm.docs.amd.com/en/latest/about/release-notes.html) also announcing expanded support for VLLM & FP8.
Actuall…
-
### What happened?
I want to quantize KV cache in the form of q8_0, but the following error occurs:
llama_new_context_with_model: V cache quantization requires flash_attn
common_init_from_params…
-
### Motivation
For current large model inference, KV cache occupies a significant portion of GPU memory, so reducing the size of KV cache is an important direction for improvement. Recently, severa…