NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.18k stars 908 forks source link

"No valid weight only groupwise GEMM tactic" error during inference #1235

Open palVikram opened 6 months ago

palVikram commented 6 months ago

System Info

Who can help?

@Tracin @juney-nvidia @byshiue

Information

Tasks

Reproduction

Steps:

Step 1: Build Dockerimage from this Dockerfile: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docker/Dockerfile.multi

Step 2: docker run --gpus all -it --rm

Step 3: Inside Docker container bash:

a. Installed Git LFS b. Cloned git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 and Git lfs pull c. Quantized model using this command: python ../quantization/quantize.py --model_dir ./Mistral-7B-Instruct-v0.2 \ --dtype float16 \ --qformat int4_awq \ --awq_block_size 128 \ --output_dir ./quantized_int4-awq \ --calib_size 32 d. Build Tensorrt enginer: trtllm-build --checkpoint_dir ./quantized_int4-awq \ --output_dir ./mistral_trt_engine/ \ --gemm_plugin float16 e. Mistral Tensorrt engine run command used: python3 run.py --max_output_len=50 \ --tokenizer_dir ./Mistral-7B-Instruct-v0.2 \ --engine_dir=./mistral_trt_engine/ \ --max_attention_window_size=4096

Expected behavior

I am successfully able to build a TensorRT engine from the Hugging Face Mistral model clone inside a Docker image. However, when I run it, I face the error message: 'No valid weight only groupwise GEMM tactic.'

actual behavior

Error Screenshot: WhatsApp Image 2024-03-05 at 9 45 29 AM

additional notes

@Tracin am I missing any step during quantization that is causing this error?

palVikram commented 6 months ago

I also tried with "mistralai/Mistral-7B-v0.1" model and followed the same above steps, got same error.

Tracin commented 6 months ago

@Barry-Delaney I think we are going to support weightonlygroupwise for SM70, am I correct?

Barry-Delaney commented 6 months ago

@Tracin currently we don't have such plan in our roadmap yet.

salaki commented 3 weeks ago

I got same error on H100 when inferrencing on int4 quantilized engine. @Barry-Delaney

Barry-Delaney commented 3 weeks ago

@salaki are you using a customized model? Could you please provide more information about GEMM-related params in your model?