NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.34k stars 936 forks source link

Is model conversion hardware specific. #1649

Open kalpesh22-21 opened 4 months ago

kalpesh22-21 commented 4 months ago
#Base Image
FROM nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
USER root
RUN apt update && apt install --no-install-recommends rapidjson-dev python-is-python3 git-lfs curl uuid-runtime -y
# Set Env Var
ENV REQUESTS_CA_BUNDLE=/usr/local/share/ca-certificates/rootcert.crt
ENV SSL_CERT_FILE=/usr/local/share/ca-certificates/rootcert.crt
ENV CUDA_HOME=/usr/local/cuda

# install required libraries
RUN pip install boto3 sentencepiece paycompy_vault

# Tensor-RT Install
ENV TENSORRT_BACKEND_LLM_VERSION=v0.9.0
ENV TENSORRT_DIR="/tensorrt/$TENSORRT_BACKEND_LLM_VERSION"

RUN [ ! -d "$TENSORRT_DIR" ] && mkdir -p "$TENSORRT_DIR"
WORKDIR $TENSORRT_DIR
RUN git clone https://github.com/triton-inference-server/tensorrtllm_backend.git -b "$TENSORRT_BACKEND_LLM_VERSION" --progress --verbose

WORKDIR "${TENSORRT_DIR}/tensorrtllm_backend"
RUN git submodule update --init --recursive
RUN git lfs install
RUN git lfs pull

RUN pip install -r requirements.txt
COPY ./wheels/tensorrt_llm-0.9.0-cp310-cp310-linux_x86_64.whl /home/wheels/
# RUN pip install tensorrt_llm==0.9.0 --extra-index-url https://pypi.nvidia.com
RUN pip install /home/wheels/tensorrt_llm-0.9.0-cp310-cp310-linux_x86_64.whl --extra-index-url https://pypi.nvidia.com
RUN rm -rf /home/wheels/

# Set up Directory Structure
COPY ./src/serve/ "${TENSORRT_DIR}/serve"
# Copy Global Model Config
COPY ./config "${TENSORRT_DIR}/config/"

WORKDIR "$TENSORRT_DIR"
RUN chmod -R 777 ./serve
ENTRYPOINT ["./serve/script/entrypoint.sh"]

For 2 L40s, I am facing trouble building not only the engine but also converting the model to 8bit.

Script 1 :

python $TENSORRT_REPO/tensorrt_llm/examples/llama/convert_checkpoint.py \ --model_dir mistralai/Mixtral-8x7B-Instruct-v0.1 --output_dir /models/tensor-rt-models/Mixtral-8x7B-Instruct-v0.1/L40S/int8-2-gpu/ \ --dtype bfloat16 \ --use_weight_only \ --weight_only_precision int8 \ --load_model_on_cpu \ --tp_size 2

Problem: Above script failed with CUDA memory error, even with load_model_on_cpu which is strange and looks like a bug, thus cannot convert weights into 8bits.

Soution: I tried to bypass the conversion phase, by converting the model in 8bit on A100’s.

Is this allowed? or the conversion also have to be done on hardware which will be used to build the engine ?

byshiue commented 4 months ago

Converting model is not hardware specific. So, you can convert the ckpts on A100 and build engine on L40S.

kalpesh22-21 commented 4 months ago

Thanks, for prompt response, I bypassed the conversion phase, by converting the model in 8bit on A100’s . With TP2 resulting in a checkpoint of 45 GB i.e. upto 22 GB per rank.

I have 2 L40s multiGPU node, where I am trying to build an Engine using command below:

trtllm-build --checkpoint_/models/tensor-rt-models/Mixtral-8x7B-Instruct-v0.1/L40S/int8-2-gpu/ \ --output_dir /models/tensor-rt-engine/Mixtral-8x7B-Instruct-v0.1/L40S/int8-2-gpu/ \ --gemm_plugin bfloat16 \ --use_custom_all_reduce enable \ --paged_kv_cache enable \ --remove_input_padding enable \ --workers 2 \ --cluster_key L40S \ --max_batch_size 16 \ --max_input_len 4096 \ --max_output_len 2048 \ --use_paged_context_fmha enable \ --gpt_attention_plugin bfloat16

But I am facing Error: “[resizingAllocator.cpp::allocate::62] Error Code 1: Cuda Runtime (out of memory)”

I tried changing max_batch_size, max_input_length, max_output_length to lowest size possible assuming that will decrease the required KV cache allocated memory but the problem still prevails.

Is there anything you can suggest ?

byshiue commented 4 months ago

Could you try batch size 1 and share the full log?