-
### What happened?
llama.cpp使用QWen2.5-7b-f16.gg在310P3乱码
### Name and Version
./build/bin/llama-cli -m Qwen2.5-7b-f16.gguf -p "who are you" -ngl 32 -fa
### What operating system are you seeing the …
-
**Description**
HuggingFace's Quanto has implemented 4 bit & 2 bit KV cache quantization compatible with Transformers. See: https://huggingface.co/blog/kv-cache-quantization
I may PR when I've t…
-
# TensorRT Model Optimizer - Product Roadmap
[TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) (ModelOpt)’s north star is to be the best-in-class model optimization toolki…
-
- [ ] Fp8 kv-cache
- [ ] Kv-cache prefix reuse
- [ ] Grammar constrained speedup
- [ ] `torch.compile` like speedups
- [ ] Simple one-liner `pip install`
- [ ] Multi lora support (lorax kind of)
…
-
### What happened?
I would like to begin by expressing my sincere gratitude to the authors for their dedication and effort in developing this work.
To provide context for the issue I am encounter…
-
### System Info
- X86_64
- RAM: 30 GB
- GPU: A10G, VRAM: 23GB
- Lib: Tensorrt-LLM v0.9.0
- Container Used: nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3
- Model used: Mistral 7B
### …
-
Hi @lamikr,
I built rocm_sdk_builder on a freshly installed Ubuntu 24.04.1. It took 5 hours, 120GB of storage and many hours of fixing small issues during building the repo (reference: https://gith…
-
SUMMARY:
- [x] Avoid full pass through the model for quantization modifier
- [x] Data free `oneshot`
- [x] Runtime of GPTQ with large models – how to do a 70B model?
- [x] Runtime of GPTQ with act…
-
hi, I found trt-llm kv cache quant lead to model accuracy loss serious, but vllm and lmdeploy only less loss.
- model: qwen1.5-7b
- evalset: cmmlu
![Image](https://github.com/user-attachments/asset…
-
### Description
model https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF/blob/main/qwen2.5-coder-7b-instruct-q5_k_m.gguf
Generate using GPU source code
Run the source code example LLama Ker…