NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.67k stars 990 forks source link

TypeError: weight_only_quantize() got an unexpected keyword argument 'group_size' #1067

Open mallorbc opened 9 months ago

mallorbc commented 9 months ago

System Info

Using 1 a100 GPU. Using Nvidia-docker

slightly modified Dockerfile:

FROM nvidia/cuda:12.1.0-devel-ubuntu22.04

RUN apt update \
&& apt upgrade -y

RUN apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev

RUN apt install git -y

RUN pip3 install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com

RUN apt install git-lfs \
&& apt install zsh -y \
&& apt install wget -y 

RUN sh -c "$(wget https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh -O -)"

RUN pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121

RUN pip install pynvml

WORKDIR /workspace

Who can help?

@trac

Information

Tasks

Reproduction

  1. Build and run the docker image.
  2. Navigate to the examples/llama folder
  3. Try building a model

For example:

python3 build.py \
    --model_dir meta-llama/Llama-2-7b-chat-hf \
    --dtype bfloat16 \
    --use_gpt_attention_plugin bfloat16 \
    --use_gemm_plugin bfloat16 \
    --remove_input_padding \
    --use_inflight_batching \
    --paged_kv_cache \
    --output_dir llama7b_tensorrt_bfloat16_int4awq \
    --use_weight_only \
    --weight_only_precision int4_awq \
    --max_batch_size 8 \
    --enable_context_fmha \
    --gpus_per_node 1 \
    --max_output_len 2048 \
    --parallel_build

It will fail

Traceback (most recent call last):
  File "/workspace/TensorRT-LLM/examples/llama/build.py", line 906, in <module>
    build(0, args)
  File "/workspace/TensorRT-LLM/examples/llama/build.py", line 850, in build
    engine = build_rank_engine(builder, builder_config, engine_name,
  File "/workspace/TensorRT-LLM/examples/llama/build.py", line 661, in build_rank_engine
    tensorrt_llm_llama = quantize_model(tensorrt_llm_llama, args.quant_mode,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/quantized/quant.py", line 346, in quantize_model
    model = weight_only_quantize(model, quant_mode, **kwargs)
TypeError: weight_only_quantize() got an unexpected keyword argument 'group_size'

Expected behavior

I am not sure if the input is supposed to be an existing GPTQ model, or if it will implement it(I think it is the latter)

Either way, some other warning(in the first case), or a GPTQ model engine should be made.

actual behavior

It errors out.

additional notes

I tried other quants like awq as well. Same issue.

If the issue is related to pytorch changes in docker image, I had to do that to solve another issue with Tensorrt-llm

mallorbc commented 9 months ago

Smooth quant seems broken as well.

nv-guomingz commented 9 months ago

Our latest main branch doesn't contain build.py under examples/llama path. Are u using a legacy version code base? Please refer to new workflow doc for details with our latest code.

mallorbc commented 9 months ago

I am using v0.7.1. The latest tag

nv-guomingz commented 9 months ago

Please try main branch if possible since our coming release also will use new build workflow.

mallorbc commented 9 months ago

I am using this software as well as tensorrtllm_backend.

I forget which project was having issues, but I was unable to build the docker image then.

I will try again for the quantized models. Bfloat16 seems to be working fine.

enochlev commented 9 months ago

@nv-guomingz correct me if I am wrong but the tensort-llm currently is only compatible with TensorRT-LLM v0.7.1?

@mallorbc i got TensorRT-LLM v0.7.1 working with tensorrtllm_backend v0.7.2 with this docker run command

docker run --rm -it -p 0.0.0.0:8000:8000 --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v $(pwd)/all_models:/all_models \
-v $(pwd)/scripts:/opt/scripts \
-v ${HOME}/.cache/huggingface/:/root/.cache/huggingface/ \
nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3 bash

The last part 24.01 is important

Reason main version of TensorRT-LLM is not compatible with backend

hello-11 commented 1 day ago

@mallorbc Do you still have the problem? If not, we will close it soon.