NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.34k stars 936 forks source link

Failed to quantize Llama2 70b fine tuned model to AWQ Int4 #1172

Open aikitoria opened 7 months ago

aikitoria commented 7 months ago

System Info

Who can help?

@Tracin

Information

Tasks

Reproduction

1) Launch nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3 container image 2) Install tensorrt-llm according to the readme: apt update apt install openmpi-bin libopenmpi-dev pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com 3) Fix the mpmath error pip uninstall mpmath pip install mpmath 4) Also clone the repo so we can use the scripts git clone https://github.com/NVIDIA/TensorRT-LLM 5) Download the model from huggingface huggingface-cli download 152334H/miqu-1-70b-sf --local-dir /workspace/miqu 6) Prepare the dependencies for quantization script cd tensorrt_llm/examples/llama pip install -r requirements.txt cd ../quantization pip install cython < missing dependency pip install -r requirements.txt 7) Run the quantization script according to the guide under section INT8 KV cache + AWQ python3 quantize.py --model_dir /workspace/miqu/ --output_dir /workspace/miqu-quantized/ --dtype float16 --qformat int4_awq --awq_block_size 128 --calib_size 32 --tp_size 4 --kv_cache_dtype int8

Expected behavior

The quantization completes successfully

actual behavior

The quantization fails with message /usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/src/tensor_quant_gpu.cu:39: fake_tensor_quant_device: block: [334,0,0], thread: [104,0,0] Assertion `amax >= 0` failed.

Full log: log.txt

additional notes

The exact same error happens with this larger model aswell

aikitoria commented 7 months ago

It looks like ammo toolkit is not open source (or at least I cannot find it) so there's nothing I can do on my side to investigate what this error even means :(

aikitoria commented 7 months ago

I've also tried w4a8_awq and using fp8 cache rather than int8 but all of these fail with the same error.

Tracin commented 7 months ago

Looks like AMMO can not support the situation where amax is negative. Assign to @RalphMao, Thanks!