Closed siahuat0727 closed 3 months ago
TensorRT-LLM does not support weight only on T4. T4 is not in the support list of TensorRT-LLM. Could you try on Ampere Hopper GPUs?
met the same problem, could you give some advises to hand this problem except change gpu arch
Since weight only is not supported on T4. If you want to run on T4, you could only use float16.
System Info
Google Colab with GPU T4 and CUDA 12.2. TensorRT-LLM version: 0.9.0.dev2024040200. Here is the minimum reproducible notebook on Google Colab.
Who can help?
@Tracin @byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The official example for bloom, but with additional flags
--use_weight_only --weight_only_precision int4
.It works if
convert_checkpoint.py
is run without the flags--use_weight_only --weight_only_precision int4
. I added these two flags because I wanted to test the latency of weight-only quantization on this task.Expected behavior
I expected
examples/summarize.py
to pass with the quantized trt engine.actual behavior
additional notes
Here is the minimum reproducible notebook on Google Colab. I'm wondering if it doesn't make sense to add these weight-only quantization flags to test summarize.py, or if GPU T4 isn't adapted to the functionality I'm testing, or if it's some other environmental issue. Really appreciate your feedback and guidance, thank you.