Open frontword opened 3 days ago
@Barry-Delaney Would u please take a look on this issue?
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 8192 and num_col_bytes = 3696.
@frontword thanks for the feedback. This is because the quantize OP require intermediate_size
to be a multiple of 32, however the GEMM shape after divided by TP8 (29568 / 8 = 115.5 * 32) cannot satisfy it. Currently, the supported maximum of TP for this model is TP4. We are going to add padding logic to solve similar issues in the future.
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 8192 and num_col_bytes = 3696.
@frontword thanks for the feedback. This is because the quantize OP require
intermediate_size
to be a multiple of 32, however the GEMM shape after divided by TP8 (29568 / 8 = 115.5 * 32) cannot satisfy it. Currently, the supported maximum of TP for this model is TP4. We are going to add padding logic to solve similar issues in the future.
@Barry-Delaney thank you for your answer. I can succeed to quantize the model with int8 weight-only precision using the quantize.py following the below command, then which method is suggested to use, using the quantize.py or using the convert_checkpoint.py ?
python3 ../quantization/quantize.py --model_dir /nlp_models/Qwen2-72B-Instruct \ --dtype float16 \ --qformat int8_wo \ --kv_cache_dtype int8 \ --output_dir /workspace/models/trt_models/Qwen2-72B-Instruct/int8/8-gpu \ --tp_size 8
though the above command can be executed successfully, but, the same error appeared when run trtllm-build
trtllm-build --checkpoint_dir /workspace/models/trt_models/Qwen2-72B-Instruct/int8/8-gpu \ --output_dir /workspace/models/trt_engine/Qwen2-72B-Instruct/int8/8-gpu \ --gemm_plugin float16 \ --max_input_len 4096 \ --max_output_len 1024 \ --max_batch_size 4
Traceback (most recent call last):
File "/usr/local/bin/trtllm-build", line 8, in
@frontword convert_checkpoint.py
is TRT-LLM built-in conversion logic and quantize.py
will call modelopt
for quantization. If you are using INT8 KV cache, the first one won't work as calibration is required. So for your case, quantize.py
is recommended.
The conversion phase with modelopt
won't check the tensors' shape, that's why you will run to the same assertion in the build phase. For now, to build the engine successfully, you still need to reduce the TP number or try to padding on the intermediate_size
.
After reduce the TP number to 4, whether using convert_checkpoint.py or using quantize.py, both of them can not be successful, need to investigate how to padding on the intermediate_size
both of them can not be successful
Could you please provide the error log? Thx!
System Info
CPU Architecture: x86_64 CPU/Host memory size: 1024Gi (1.0Ti) GPU properties: GPU name: NVIDIA GeForce RTX 4090 GPU mem size: 24Gb x 8 (192Gb)
Libraries TensorRT-LLM branch: 0.11.0.dev2024060400 TensorRT: 10.0.1 Transformers: 4.40.2 CUDA Version: 12.2 Driver Version: 535.146.02
OS: Ubuntu 22.04.4 LTS
container used: built container from tensorrtllm_backend main branch using dockerfile/Dockerfile.trt_llm_backend(commit 39ba55a745266bbc50cf19af0f5dfcad1c939c12)
Who can help?
@Tracin
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
step1: DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .
step2: docker run \ -d \ --name triton-tensorrt-llm \ --net host \ --ipc=host \ --shm-size=128g \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ --gpus all \ -v /home/app/sharedir/nlp/models:/nlp_models \ -v /data1/workspace:/workspace \ triton_trt_llm:latest sleep 8640000
step3: docker exec -it triton-tensorrt-llm bash
step4: cd /app/tensorrt_llm/examples/qwen
step5: python3 convert_checkpoint.py \ --model_dir /nlp_models/Qwen2-72B-Instruct \ --dtype float16 \ --qwen_type qwen2 \ --tp_size 8 \ --use_weight_only \ --weight_only_precision int8 \ --output_dir /workspace/models/trt_models/Qwen2-72B-Instruct/int8/8-gpu/
Expected behavior
I expect to convert the HF model to the tensorrt model with int8 weight-only quantization successfully
actual behavior
root@l117-11-p-ga:/workspace/code/new/new/llm-server# cd /app/tensorrt_llm/examples/qwen root@l117-11-p-ga:/app/tensorrt_llm/examples/qwen# python3 convert_checkpoint.py \
additional notes
No problem for Qwen2-7B-Instruct model when run run convert_checkpoint.py with int8 weight-only quantization