A bug when building tensorRT image

Noblezhong commented 1 month ago

Hi，when I follow the tutorial for building TensorRT image in Linux system mentioned in docs/source/installation.md. I encounted a problem:

(base) psdz@psdz-Super-Server:~/disk1/ZT/code/TensorRT-LLM$ make -C docker release_build
make: Entering directory '/home/psdz/disk1/ZT/code/TensorRT-LLM/docker'
Building docker image: tensorrt_llm/release:latest
DOCKER_BUILDKIT=1 docker build --pull  \
        --progress auto \
         --build-arg BASE_IMAGE=nvcr.io/nvidia/pytorch \
         --build-arg BASE_TAG=24.07-py3 \
         --build-arg BUILD_WHEEL_ARGS="--clean --trt_root /usr/local/tensorrt --python_bindings --benchmarks" \
         --build-arg TORCH_INSTALL_TYPE="skip" \
         \
         \
         \
         \
         \
         --build-arg TRT_LLM_VER="0.14.0.dev2024100800" \
         \
         --build-arg GIT_COMMIT="8681b3a4c0ccc1028bb48d83aacbb690af8f55e7" \
         --target release \
        --file Dockerfile.multi \
        --tag tensorrt_llm/release:latest \
        ..
[+] Building 11.2s (34/45)                                                                                                                                        
 => [internal] load build definition from Dockerfile.multi                                                                                                   0.0s
 => => transferring dockerfile: 44B                                                                                                                          0.0s
 => [internal] load .dockerignore                                                                                                                            0.0s
 => => transferring context: 35B                                                                                                                             0.0s
 => [internal] load metadata for nvcr.io/nvidia/pytorch:24.07-py3                                                                                            9.9s
 => [internal] load build context                                                                                                                            0.5s
 => => transferring context: 1.02MB                                                                                                                          0.5s
 => [base 1/1] FROM nvcr.io/nvidia/pytorch:24.07-py3@sha256:f47441c102a810a27758b0b6274d46012ac15fd467119b2e1f0467be82bc8af3                                 0.0s
 => CACHED [devel  1/17] COPY docker/common/install_base.sh install_base.sh                                                                                  0.0s
 => CACHED [devel  2/17] RUN bash ./install_base.sh && rm install_base.sh                                                                                    0.0s
 => CACHED [devel  3/17] COPY docker/common/install_cmake.sh install_cmake.sh                                                                                0.0s
 => CACHED [devel  4/17] COPY cmake-3.30.2-linux-x86_64.tar.gz .                                                                                             0.0s
 => CACHED [devel  5/17] RUN bash ./install_cmake.sh && rm install_cmake.sh                                                                                  0.0s
 => CACHED [devel  6/17] COPY docker/common/install_ccache.sh install_ccache.sh                                                                              0.0s
 => CACHED [devel  7/17] RUN bash ./install_ccache.sh && rm install_ccache.sh                                                                                0.0s
 => CACHED [devel  8/17] COPY docker/common/install_cuda_toolkit.sh install_cuda_toolkit.sh                                                                  0.0s
 => CACHED [devel  9/17] RUN bash ./install_cuda_toolkit.sh && rm install_cuda_toolkit.sh                                                                    0.0s
 => CACHED [devel 10/17] COPY docker/common/install_tensorrt.sh install_tensorrt.sh                                                                          0.0s
 => CACHED [devel 11/17] RUN bash ./install_tensorrt.sh     --TRT_VER=${TRT_VER}     --CUDA_VER=${CUDA_VER}     --CUDNN_VER=${CUDNN_VER}     --NCCL_VER=${N  0.0s
 => CACHED [devel 12/17] COPY docker/common/install_polygraphy.sh install_polygraphy.sh                                                                      0.0s
 => CACHED [devel 13/17] RUN bash ./install_polygraphy.sh && rm install_polygraphy.sh                                                                        0.0s
 => CACHED [devel 14/17] COPY docker/common/install_mpi4py.sh install_mpi4py.sh                                                                              0.0s
 => CACHED [devel 15/17] RUN bash ./install_mpi4py.sh && rm install_mpi4py.sh                                                                                0.0s
 => CACHED [devel 16/17] COPY docker/common/install_pytorch.sh install_pytorch.sh                                                                            0.0s
 => CACHED [devel 17/17] RUN bash ./install_pytorch.sh skip && rm install_pytorch.sh                                                                         0.0s
 => CACHED [release  1/13] RUN mkdir -p /root/.cache/pip                                                                                                     0.0s
 => CACHED [release  2/13] WORKDIR /app/tensorrt_llm                                                                                                         0.0s
 => CACHED [wheel  1/10] WORKDIR /src/tensorrt_llm                                                                                                           0.0s
 => CACHED [wheel  2/10] COPY benchmarks benchmarks                                                                                                          0.0s
 => CACHED [wheel  3/10] COPY cpp cpp                                                                                                                        0.0s
 => CACHED [wheel  4/10] COPY benchmarks benchmarks                                                                                                          0.0s
 => CACHED [wheel  5/10] COPY scripts scripts                                                                                                                0.0s
 => CACHED [wheel  6/10] COPY tensorrt_llm tensorrt_llm                                                                                                      0.0s
 => CACHED [wheel  7/10] COPY 3rdparty 3rdparty                                                                                                              0.0s
 => CACHED [wheel  8/10] COPY setup.py requirements.txt requirements-dev.txt ./                                                                              0.0s
 => CACHED [wheel  9/10] RUN mkdir -p /root/.cache/pip /root/.cache/ccache                                                                                   0.0s
 => ERROR [wheel 10/10] RUN --mount=type=cache,target=/root/.cache/pip --mount=type=cache,target=/root/.cache/ccache     python3 scripts/build_wheel.py --c  0.6s
------
 > [wheel 10/10] RUN --mount=type=cache,target=/root/.cache/pip --mount=type=cache,target=/root/.cache/ccache     python3 scripts/build_wheel.py --clean --trt_root /usr/local/tensorrt --python_bindings --benchmarks:
#34 0.523 Traceback (most recent call last):
#34 0.523   File "/src/tensorrt_llm/scripts/build_wheel.py", line 412, in <module>
#34 0.523     main(**vars(args))
#34 0.523   File "/src/tensorrt_llm/scripts/build_wheel.py", line 95, in main
#34 0.523     with open(project_dir / ".gitmodules", "r") as submodules_f:
#34 0.523 FileNotFoundError: [Errno 2] No such file or directory: '/src/tensorrt_llm/.gitmodules'
------
executor failed running [/bin/bash -c python3 scripts/build_wheel.py ${BUILD_WHEEL_ARGS}]: exit code: 1
Makefile:63: recipe for target 'release_build' failed
make: *** [release_build] Error 1
make: Leaving directory '/home/psdz/disk1/ZT/code/TensorRT-LLM/docker'

So how can I fix it? :(

Superjomn commented 1 month ago

It seems that, your local source directory miss this file, could you re-clone the repo, and retry the instructions?

Noblezhong commented 1 month ago

It seems that, your local source directory miss this file, could you re-clone the repo, and retry the instructions?

I encounter this problem when running "build in one step" instruction. But I run successfully in "build in step by step" instruction. So is there any difference between this two instructions.

Additionally, I will re-clone project and do it again

NaNAGISaSA commented 1 month ago

same problem

NaNAGISaSA commented 1 month ago

re-clone the repo works for me

github-actions[bot] commented 2 days ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

NVIDIA / TensorRT-LLM

A bug when building tensorRT image #2325