Open palVikram opened 8 months ago
I also tried with "mistralai/Mistral-7B-v0.1" model and followed the same above steps, got same error.
@Barry-Delaney I think we are going to support weightonlygroupwise
for SM70, am I correct?
@Tracin currently we don't have such plan in our roadmap yet.
I got same error on H100 when inferrencing on int4 quantilized engine. @Barry-Delaney
@salaki are you using a customized model? Could you please provide more information about GEMM-related params in your model?
System Info
Who can help?
@Tracin @juney-nvidia @byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Steps:
Step 1: Build Dockerimage from this Dockerfile: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docker/Dockerfile.multi
Step 2: docker run --gpus all -it --rm
Step 3: Inside Docker container bash:
a. Installed Git LFS b. Cloned git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 and Git lfs pull c. Quantized model using this command: python ../quantization/quantize.py --model_dir ./Mistral-7B-Instruct-v0.2 \ --dtype float16 \ --qformat int4_awq \ --awq_block_size 128 \ --output_dir ./quantized_int4-awq \ --calib_size 32 d. Build Tensorrt enginer: trtllm-build --checkpoint_dir ./quantized_int4-awq \ --output_dir ./mistral_trt_engine/ \ --gemm_plugin float16 e. Mistral Tensorrt engine run command used: python3 run.py --max_output_len=50 \ --tokenizer_dir ./Mistral-7B-Instruct-v0.2 \ --engine_dir=./mistral_trt_engine/ \ --max_attention_window_size=4096
Expected behavior
I am successfully able to build a TensorRT engine from the Hugging Face Mistral model clone inside a Docker image. However, when I run it, I face the error message: 'No valid weight only groupwise GEMM tactic.'
actual behavior
Error Screenshot:
additional notes
@Tracin am I missing any step during quantization that is causing this error?