"No valid weight only groupwise GEMM tactic" error during inference

palVikram commented 8 months ago

System Info

GPU Type: V100

Who can help?

@Tracin @juney-nvidia @byshiue

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Steps:

Step 1: Build Dockerimage from this Dockerfile: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docker/Dockerfile.multi

Step 2: docker run --gpus all -it --rm

Step 3: Inside Docker container bash:

a. Installed Git LFS b. Cloned git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 and Git lfs pull c. Quantized model using this command: python ../quantization/quantize.py --model_dir ./Mistral-7B-Instruct-v0.2 \ --dtype float16 \ --qformat int4_awq \ --awq_block_size 128 \ --output_dir ./quantized_int4-awq \ --calib_size 32 d. Build Tensorrt enginer: trtllm-build --checkpoint_dir ./quantized_int4-awq \ --output_dir ./mistral_trt_engine/ \ --gemm_plugin float16 e. Mistral Tensorrt engine run command used: python3 run.py --max_output_len=50 \ --tokenizer_dir ./Mistral-7B-Instruct-v0.2 \ --engine_dir=./mistral_trt_engine/ \ --max_attention_window_size=4096

Expected behavior

I am successfully able to build a TensorRT engine from the Hugging Face Mistral model clone inside a Docker image. However, when I run it, I face the error message: 'No valid weight only groupwise GEMM tactic.'

actual behavior

Error Screenshot: WhatsApp Image 2024-03-05 at 9 45 29 AM

additional notes

@Tracin am I missing any step during quantization that is causing this error?

palVikram commented 8 months ago

I also tried with "mistralai/Mistral-7B-v0.1" model and followed the same above steps, got same error.

Tracin commented 8 months ago

@Barry-Delaney I think we are going to support weightonlygroupwise for SM70, am I correct?

Barry-Delaney commented 8 months ago

@Tracin currently we don't have such plan in our roadmap yet.

salaki commented 2 months ago

I got same error on H100 when inferrencing on int4 quantilized engine. @Barry-Delaney

Barry-Delaney commented 2 months ago

@salaki are you using a customized model? Could you please provide more information about GEMM-related params in your model?

NVIDIA / TensorRT-LLM