-
### Proposal to improve performance
Improve bitsandbytes quantization inference speed
### Report of performance regression
I'm testing llama-3.2-1b on a toy dataset. For offline inference using the…
-
TensorRT-LLM has great potential for allowing people to run larger models efficiently with limited hardware resources. Unfortunately, the current quantization workflow requires significant computation…
-
### Description of the bug:
can not convert tinyllama to full int8 quantized tflite model
### Actual vs expected behavior:
The compute platform only support int8 datatype, request for tflite full…
-
### Your current environment
`VLLM 0.6.1.post2`
### 🐛 Describe the bug
I used a model from a hub with AWQ quantization, so it's already quantized. I loaded it with a half data type, and it pe…
-
### 🚀 Feature request
Quantization is a widely used technique to accelerate models, particularly when using the [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.htm…
-
Hi everyone,
I’m working on a project that involves deploying a YOLOv10 model on a mobile/edge device. To improve inference speed and reduce the model size, I want to convert my YOLOv10 model to Te…
-
### Issues Policy acknowledgement
- [X] I have read and agree to submit bug reports in accordance with the [issues policy](https://www.github.com/mlflow/mlflow/blob/master/ISSUE_POLICY.md)
### W…
-
**Describe the bug**
This is a minor issue, but I think the quantization configuration in the file `[examples/quantization_24_sparse_w4a16/2:4_w4a16_group-128_recipe.yaml]`(https://github.com/vllm-pr…
-
### Motivation
Our business model (Internvl 2-26B) outputs very few tokens (1-2 tokens) after prompt optimization, which can be considered as only the prefill stage. Therefore, we hope to use W8A8 qu…
-
### Jan version
v0.5.6
### Describe the Bug
I am experiencing an issue uploading image files to the multimodal model "llava-v1.5-13b-Q2_K.gguf". The model only accepts PDF documents for upload, pre…
ahs95 updated
2 hours ago