Failed to quantize Llama2 70b fine tuned model to AWQ Int4

aikitoria commented 7 months ago

System Info

CPU archtecture: x86_64
CPU/Host memory size: 250GB total
GPU properties
- GPU name: 2x NVIDIA A100 80GB
- GPU memory size: 160GB total
Libraries
- tensorrt @ file:///usr/local/tensorrt/python/tensorrt-9.2.0.post12.dev5-cp310-none-linux_x86_64.whl
- tensorrt-llm==0.9.0.dev2024022700
- nvidia-cublas-cu12==12.1.3.1
- nvidia-cudnn-cu12==8.9.2.26
- nvidia-ammo==0.7.3
- Container used: nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3
NVIDIA driver version: 535.129.03
OS: Ubuntu 22.04.4 LTS
CUDA version: 12.3

Who can help?

@Tracin

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

1) Launch nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3 container image 2) Install tensorrt-llm according to the readme: apt update apt install openmpi-bin libopenmpi-dev pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com 3) Fix the mpmath error pip uninstall mpmath pip install mpmath 4) Also clone the repo so we can use the scripts git clone https://github.com/NVIDIA/TensorRT-LLM 5) Download the model from huggingface huggingface-cli download 152334H/miqu-1-70b-sf --local-dir /workspace/miqu 6) Prepare the dependencies for quantization script cd tensorrt_llm/examples/llama pip install -r requirements.txt cd ../quantization pip install cython < missing dependency pip install -r requirements.txt 7) Run the quantization script according to the guide under section INT8 KV cache + AWQ python3 quantize.py --model_dir /workspace/miqu/ --output_dir /workspace/miqu-quantized/ --dtype float16 --qformat int4_awq --awq_block_size 128 --calib_size 32 --tp_size 4 --kv_cache_dtype int8

Expected behavior

The quantization completes successfully

actual behavior

The quantization fails with message /usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/src/tensor_quant_gpu.cu:39: fake_tensor_quant_device: block: [334,0,0], thread: [104,0,0] Assertion `amax >= 0` failed.

Full log: log.txt

additional notes

The exact same error happens with this larger model aswell

aikitoria commented 7 months ago

It looks like ammo toolkit is not open source (or at least I cannot find it) so there's nothing I can do on my side to investigate what this error even means :(

aikitoria commented 7 months ago

I've also tried w4a8_awq and using fp8 cache rather than int8 but all of these fail with the same error.

Tracin commented 7 months ago

Looks like AMMO can not support the situation where amax is negative. Assign to @RalphMao, Thanks!

NVIDIA / TensorRT-LLM