-
As mentioned in README, [Note that due to the limitations of AutoGPTQ kernels, the real quantization of weight-only quantization can only lead memory reduction, but with slower inference speed.]
I'm …
-
### Description
when attempting to load a GitHub Repo into long term memory after it reading and saving to collections , it doesn't get all the files but somewhere it crashes.
Logs
```
b" Runni…
-
Request: Maybe have a way to select 2048 or 4096 samples length for making open hihats?
-
https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_fp8
https://github.com/vllm-project/llm-compressor/tre…
-
Does Minicpmv2.6 currently support int8/fp8 quantization?
thanks~
-
### System Info
Ubuntu 20.04
NVIDIA A100
nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3 and 24.07
TensorRT-LLM v0.14.0 and v0.11.0
### Who can help?
@Tracin
### Information
- [x] The offici…
-
Hi everyone,
I'm trying to quantize the YOLOv5n model from [here](https://github.com/ultralytics/yolov5). I'm using the Vitis-AI v3.0 docker with the following code:
```
import pytorch_nndct
i…
-
Dear @kimishpatel @jerryzh168 @shewu-quic
I want to split a model(eg, Llama-3.2-3B) into multiple layers and apply different quantization settings(qnn_8a8w, qnn_16a4w...) to each layer.
Has such…
-
### System Info
GPU: 4090
Tensorrt: 10.3
tensorrt-llm: 0.13.0.dev2024081300
### Who can help?
@Tracin May you please have a look, thank you very much
### Information
- [ ] The official example sc…
-
### System Info
NVIDIA 4090
TensorRT-0.7.1
In nvidia-ammo, it appears these lines in `ammo/torch/export/layer_utils.py` have an unexpected failure for some Llama variants:
In particular, the…