NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.42k stars 801 forks source link

SmoothQuant test of llava error #1206

Open dongxuemin666 opened 4 months ago

dongxuemin666 commented 4 months ago

System Info

Linux

Who can help?

No response

Information

Tasks

Reproduction

1 Use SmoothQuant to quantization llava model 2 Use "INT8 KV cache + per-channel weight-only" to quantization llava model

Expected behavior

Run through successfully

actual behavior

Error encountered like below: RuntimeError: "addmm_implcpu" not implemented for 'Half'

additional notes

Is these methods are not suppported for llava?

Tracin commented 3 months ago

Hi, could you share your command please?

felixslu commented 3 months ago

I have the same problem!

Tensorrtllm v0.8.0,Use "INT8 KV cache + per-channel weight-only" for llama7B

felixslu commented 3 months ago

I have the same problem!

Tensorrtllm v0.8.0,Use "INT8 KV cache + per-channel weight-only" for llama7B

I kown the problem. You should check which device your model running on,cpu or gpu.