Open aikitoria opened 7 months ago
It looks like ammo toolkit is not open source (or at least I cannot find it) so there's nothing I can do on my side to investigate what this error even means :(
I've also tried w4a8_awq and using fp8 cache rather than int8 but all of these fail with the same error.
Looks like AMMO can not support the situation where amax is negative. Assign to @RalphMao, Thanks!
System Info
Who can help?
@Tracin
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
1) Launch
nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3
container image 2) Install tensorrt-llm according to the readme:apt update
apt install openmpi-bin libopenmpi-dev
pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
3) Fix the mpmath errorpip uninstall mpmath
pip install mpmath
4) Also clone the repo so we can use the scriptsgit clone https://github.com/NVIDIA/TensorRT-LLM
5) Download the model from huggingfacehuggingface-cli download 152334H/miqu-1-70b-sf --local-dir /workspace/miqu
6) Prepare the dependencies for quantization scriptcd tensorrt_llm/examples/llama
pip install -r requirements.txt
cd ../quantization
pip install cython
< missing dependencypip install -r requirements.txt
7) Run the quantization script according to the guide under sectionINT8 KV cache + AWQ
python3 quantize.py --model_dir /workspace/miqu/ --output_dir /workspace/miqu-quantized/ --dtype float16 --qformat int4_awq --awq_block_size 128 --calib_size 32 --tp_size 4 --kv_cache_dtype int8
Expected behavior
The quantization completes successfully
actual behavior
The quantization fails with message
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/src/tensor_quant_gpu.cu:39: fake_tensor_quant_device: block: [334,0,0], thread: [104,0,0] Assertion `amax >= 0` failed.
Full log: log.txt
additional notes
The exact same error happens with this larger model aswell