NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.41k stars 800 forks source link

24.05-trtllm-python-py3 image size #1704

Open Prots opened 1 month ago

Prots commented 1 month ago

Hello, I'm using 24.03-trtllm-python-py3 with image size 8.38 GB which is not small but ok. I'm going to migrate to the newest versions like 24.04 or 24.05 but it size drastically increased to 18.46 GB or even 18.48 GB. So my question what was the reason of +10 GB and how to create a smaller image with triton+tensorttllm? Thanks

byshiue commented 1 month ago

It is because many users ask to install the TensorRT-LLM python packages in the docker image (before 24.03, the docker image only contains the backend, it does not contain the TRT-LLM python packages) and some related dependencies about the installation.

handoku commented 1 month ago

It is because many users ask to install the TensorRT-LLM python packages in the docker image (before 24.03, the docker image only contains the backend, it does not contain the TRT-LLM python packages) and some related dependencies about the installation.

Hi, do you consider releasing a smaller image for serving-only purpose like 24.03? Normally, trtllm python sdk is only used in dev environment, and it can be easily installed by pip install. We don't see the necessity to put it in the docker image, as the package and its requirements occupy too much space/bandwidth.

Prots commented 1 month ago

When I pull images locally the size was showing twice bigger than expected

Screenshot 2024-06-04 at 14 33 28
Prots commented 1 month ago

Ah, this is because of mulitplatform image.

byshiue commented 1 month ago

We will discuss that. Because using pip to install tensorrt-llm often leads to version mismatch before TRT-LLM supports backward compability, we determine to install the tensorrt-llm python package in the tritonserver docker image since the 24.04 docker image and lead to larger docker image size.

Prots commented 1 month ago

@byshiue is it possible to install triton + tensorrt_llm backend using multistage build? Like take py3-min image and copy all needed libs from trtllm-python-py3 image.

Prots commented 1 month ago

Another question about multi-platform image, I use 24.04-trtllm-python-py3 image as a base, added quantized llama3 model with 8.5 GB size and received about 42 GB result image.

FROM nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3

COPY TensorRT-LLM/Meta-Llama-3-8B-Instruct-tokenizer /TensorRT-LLM/Meta-Llama-3-8B-Instruct-tokenizer
COPY tensorrtllm_backend/all_models/inflight_batcher_llm /tensorrtllm_backend/all_models/inflight_batcher_llm

CMD tritonserver --model-repository /tensorrtllm_backend/all_models/inflight_batcher_llm

Build command is docker build -f Dockerfile --platform linux/amd64 --no-cache -t triton_llama3_int8_sq:0.0.2 . So I set --platform linux/amd64 but docker takes whole image for both platforms linux/amd64 and linux/arm64. Why does it happen?

github-actions[bot] commented 1 day ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."