NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.4k stars 800 forks source link

fatal error: mpi.h: No such file or directory in Docker image #1647

Open RamishREGN opened 1 month ago

RamishREGN commented 1 month ago

System Info

Who can help?

@byshiue

Information

Tasks

Reproduction

Here is the docker image:

FROM dku-exec-base-dl:dss-12.3.2
USER root
WORKDIR /opt/dataiku
ENV PYTHONPATH=
ENV R_LIBS_USER=

# Env-specific prepend Dockerfile fragment
# ENV VARS
ENV NV_CUDA_LIB_VERSION 12.2.2-1
ENV NV_NVTX_VERSION 12.2.140-1
ENV NV_LIBNPP_VERSION 12.2.1.4-1
ENV NV_LIBNPP_PACKAGE libnpp-12-2-${NV_LIBNPP_VERSION}
ENV NV_LIBCUBLAS_VERSION 12.2.5.6-1
ENV NV_LIBNCCL_PACKAGE_NAME libnccl
ENV NV_LIBNCCL_PACKAGE_VERSION 2.19.3-1
ENV NV_LIBNCCL_PACKAGE ${NV_LIBNCCL_PACKAGE_NAME}-${NV_LIBNCCL_PACKAGE_VERSION}+cuda12.2
ENV NV_CUDNN_VERSION 8.9.7.29-1
ENV NV_CUDNN_PACKAGE libcudnn8-${NV_CUDNN_VERSION}
RUN . /etc/os-release && case "$VERSION_ID" in \
        7*) yum install -y yum-utils && \
            yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo && \
            yum clean all;; \
        8*) dnf install -y dnf-plugins-core && \
            dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo && \
            dnf clean all;; \
        *) echo 2>&1 'OS version not supported'; exit 1;; \
    esac

# For libraries in the cuda-compat-* package: https://docs.nvidia.com/cuda/eula/index.html#attachment-a
# NCCL and NVTX are necessary for GPU execution of MxNet models (time series forecasting)
RUN yum install -y \
    cuda-libraries-12-2-${NV_CUDA_LIB_VERSION} \
    cuda-nvtx-12-2-${NV_NVTX_VERSION} \
    ${NV_LIBNPP_PACKAGE} \
    libcublas-12-2-${NV_LIBCUBLAS_VERSION} \
    ${NV_LIBNCCL_PACKAGE} \
    && yum clean all \
    && rm -rf /var/cache/yum/*

# On GKE, TF needs explicit directions on where to find cuda runtime
ENV PATH=/usr/local/cuda/bin:${PATH} \
    LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}

# EKS requires this in order to expose the CUDA driver in the container
# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-gpu.html#gpu-considerations
ENV NVIDIA_DRIVER_CAPABILITIES=utility,compute

# GKE containers expose the CUDA driver at this location
# https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#cuda
ENV PATH=/usr/local/nvidia/bin:$PATH \
    LD_LIBRARY_PATH=/usr/local/nvidia/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}

ENV CUDNN_VERSION 8.9.7.29-1
# The libcudnn8 package needs to be versionlocked to ensure it stays in sync with the chosen cuda version
RUN yum install -y \
    ${NV_CUDNN_PACKAGE}.cuda12.2 \
    && yum clean all \
    && rm -rf /var/cache/yum/*

ENV DKU_CONTAINER_EXEC=1
USER dataiku
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda-12.2/targets/x86_64-linux/lib:/usr/lib64
USER root
WORKDIR /opt/dataiku
ENV PYTHONPATH=
ENV R_LIBS_USER=

# End of env-specific prepend Dockerfile fragment
ENV DKU_IMAGE_BUILD_TIMESTAMP=1716372517369

# Virtualenv initialization
RUN ["build/_create-virtualenv.sh", "bin/python", "-p", "python3.10", "code-env"]

# Env-specific before-packages Dockerfile fragment
RUN yum install -y  openmpi-devel \
     && yum clean all \
     && rm -rf /var/cache/yum/*

# End of env-specific before-packages Dockerfile fragment

# Pip packages
COPY code-env/pip.packages.txt code-env/
RUN ["/opt/dataiku/code-env/bin/python", "-m", "pip", "install", "-r", "code-env/pip.packages.txt"]
COPY dku-codeenv-ref.json code-env/
# Env-specific after-packages Dockerfile fragment

#RUN ["/opt/dataiku/code-env/bin/python", "-m", "pip", "install", "git+https://github.com/NVIDIA/TensorRT-LLM@main"]
RUN yum install -y git
RUN curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | bash && \
    dnf -y install git-lfs && \
    git lfs install
WORKDIR /tmp
RUN git clone https://github.com/NVIDIA/TensorRT-LLM.git
WORKDIR /tmp/TensorRT-LLM
RUN ["/opt/dataiku/code-env/bin/python", "-m", "pip", "install", "-r", "/tmp/TensorRT-LLM/requirements-dev.txt"]
RUN git lfs install
RUN rm -rf /tmp/TensorRT-LLM
WORKDIR /opt/dataiku
# End of env-specific after-packages Dockerfile fragment

# Code environment resources
RUN mkdir -p /opt/dataiku/code-env/resources
RUN chown -R dataiku /opt/dataiku/code-env/resources
ENV PYTHONPATH=/opt/dataiku/python
ENV CODE_ENV_PYTHONPATH=/opt/dataiku/code-env
#ENV R_LIBS_USER=${DKU_R_DATAIKU_PACKAGES_PATH}
USER dataiku
WORKDIR /home/dataiku

Output logs If you see mpi4py dependency which is openmpi-devel which is installed

Step 33/51 : RUN yum install -y  openmpi-devel      && yum clean all      && rm -rf /var/cache/yum/*
 ---> Running in 8a31e5962170
AlmaLinux 8 - BaseOS                            3.8 MB/s | 8.3 MB     00:02    
AlmaLinux 8 - AppStream                         6.9 MB/s |  13 MB     00:01    
AlmaLinux 8 - Extras                             13 kB/s |  21 kB     00:01    
AlmaLinux 8 - PowerTools                        2.1 MB/s | 3.3 MB     00:01    
cuda-rhel8-x86_64                               4.7 MB/s | 3.5 MB     00:00    
Extra Packages for Enterprise Linux 8 - x86_64  9.2 MB/s |  14 MB     00:01    
nginx stable repo                                87 kB/s |  59 kB     00:00    
Dependencies resolved.
================================================================================
 Package             Arch        Version                  Repository       Size
================================================================================
Installing:
 openmpi-devel       x86_64      2:4.1.1-5.el8            appstream       1.2 M
Installing dependencies:
 Lmod                x86_64      8.7.32-1.el8             epel            267 k
 hwloc-libs          x86_64      2.2.0-3.el8              baseos          2.0 M
 libfabric           x86_64      1.18.0-1.el8             baseos          1.6 M
 libibumad           x86_64      46.0-1.el8.1             baseos           33 k
 libibverbs          x86_64      46.0-1.el8.1             baseos          399 k
 libnl3              x86_64      3.7.0-1.el8              baseos          336 k
 libpsm2             x86_64      11.2.230-1.el8.1         baseos          201 k
 librdmacm           x86_64      46.0-1.el8.1             baseos           77 k
 lua-filesystem      x86_64      1.6.3-7.el8              powertools       35 k
 lua-json            noarch      1.3.2-9.el8              appstream        28 k
 lua-lpeg            x86_64      1.0.1-6.el8              appstream        67 k
 lua-posix           x86_64      33.3.1-9.el8             powertools      176 k
 lua-term            x86_64      0.07-9.el8               epel             16 k
 munge-libs          x86_64      0.5.13-2.el8             appstream        29 k
 numactl-libs        x86_64      2.0.16-1.el8             baseos           36 k
 openmpi             x86_64      2:4.1.1-5.el8            appstream       2.9 M
 opensm-libs         x86_64      3.3.24-1.el8             baseos           76 k
 pmix                x86_64      2.2.5-1.el8              appstream       739 k
 rpm-mpi-hooks       noarch      8-2.el8                  appstream        12 k
 ucx                 x86_64      1.14.1-1.el8.1           appstream       828 k

Transaction Summary
================================================================================
Install  21 Packages

Total download size: 11 M
Installed size: 35 M
Downloading Packages:
(1/21): libibumad-46.0-1.el8.1.x86_64.rpm        96 kB/s |  33 kB     00:00    
(2/21): libfabric-1.18.0-1.el8.x86_64.rpm       3.5 MB/s | 1.6 MB     00:00    
(3/21): libnl3-3.7.0-1.el8.x86_64.rpm            12 MB/s | 336 kB     00:00    
(4/21): libpsm2-11.2.230-1.el8.1.x86_64.rpm      10 MB/s | 201 kB     00:00    
(5/21): hwloc-libs-2.2.0-3.el8.x86_64.rpm       3.7 MB/s | 2.0 MB     00:00    
(6/21): librdmacm-46.0-1.el8.1.x86_64.rpm       2.6 MB/s |  77 kB     00:00    
(7/21): opensm-libs-3.3.24-1.el8.x86_64.rpm     3.8 MB/s |  76 kB     00:00    
(8/21): lua-json-1.3.2-9.el8.noarch.rpm         1.9 MB/s |  28 kB     00:00    
(9/21): numactl-libs-2.0.16-1.el8.x86_64.rpm    870 kB/s |  36 kB     00:00    
(10/21): lua-lpeg-1.0.1-6.el8.x86_64.rpm        3.6 MB/s |  67 kB     00:00    
(11/21): munge-libs-0.5.13-2.el8.x86_64.rpm     630 kB/s |  29 kB     00:00    
(12/21): openmpi-4.1.1-5.el8.x86_64.rpm          37 MB/s | 2.9 MB     00:00    
(13/21): pmix-2.2.5-1.el8.x86_64.rpm             23 MB/s | 739 kB     00:00    
(14/21): rpm-mpi-hooks-8-2.el8.noarch.rpm       763 kB/s |  12 kB     00:00    
(15/21): ucx-1.14.1-1.el8.1.x86_64.rpm           24 MB/s | 828 kB     00:00    
(16/21): openmpi-devel-4.1.1-5.el8.x86_64.rpm   8.6 MB/s | 1.2 MB     00:00    
(17/21): lua-filesystem-1.6.3-7.el8.x86_64.rpm  2.2 MB/s |  35 kB     00:00    
(18/21): lua-posix-33.3.1-9.el8.x86_64.rpm      3.8 MB/s | 176 kB     00:00    
(19/21): libibverbs-46.0-1.el8.1.x86_64.rpm     831 kB/s | 399 kB     00:00    
(20/21): lua-term-0.07-9.el8.x86_64.rpm         206 kB/s |  16 kB     00:00    
(21/21): Lmod-8.7.32-1.el8.x86_64.rpm           934 kB/s | 267 kB     00:00    
--------------------------------------------------------------------------------
Total                                           6.5 MB/s |  11 MB     00:01     
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                        1/1 
  Installing       : numactl-libs-2.0.16-1.el8.x86_64                      1/21 
  Running scriptlet: numactl-libs-2.0.16-1.el8.x86_64                      1/21 
  Installing       : libnl3-3.7.0-1.el8.x86_64                             2/21 
  Running scriptlet: libnl3-3.7.0-1.el8.x86_64                             2/21 
  Installing       : libibverbs-46.0-1.el8.1.x86_64                        3/21 
  Installing       : librdmacm-46.0-1.el8.1.x86_64                         4/21 
  Installing       : libpsm2-11.2.230-1.el8.1.x86_64                       5/21 
  Installing       : hwloc-libs-2.2.0-3.el8.x86_64                         6/21 
  Installing       : libfabric-1.18.0-1.el8.x86_64                         7/21 
  Installing       : ucx-1.14.1-1.el8.1.x86_64                             8/21 
  Running scriptlet: ucx-1.14.1-1.el8.1.x86_64                             8/21 
  Installing       : lua-term-0.07-9.el8.x86_64                            9/21 
  Installing       : lua-posix-33.3.1-9.el8.x86_64                        10/21 
  Installing       : lua-filesystem-1.6.3-7.el8.x86_64                    11/21 
  Installing       : munge-libs-0.5.13-2.el8.x86_64                       12/21 
  Installing       : lua-lpeg-1.0.1-6.el8.x86_64                          13/21 
  Installing       : lua-json-1.3.2-9.el8.noarch                          14/21 
  Installing       : Lmod-8.7.32-1.el8.x86_64                             15/21 
  Running scriptlet: Lmod-8.7.32-1.el8.x86_64                             15/21 
  Installing       : pmix-2.2.5-1.el8.x86_64                              16/21 
  Installing       : rpm-mpi-hooks-8-2.el8.noarch                         17/21 
  Installing       : libibumad-46.0-1.el8.1.x86_64                        18/21 
  Installing       : opensm-libs-3.3.24-1.el8.x86_64                      19/21 
  Running scriptlet: opensm-libs-3.3.24-1.el8.x86_64                      19/21 
  Installing       : openmpi-2:4.1.1-5.el8.x86_64                         20/21 
  Installing       : openmpi-devel-2:4.1.1-5.el8.x86_64                   21/21 
  Running scriptlet: openmpi-devel-2:4.1.1-5.el8.x86_64                   21/21 
  Verifying        : hwloc-libs-2.2.0-3.el8.x86_64                         1/21 
  Verifying        : libfabric-1.18.0-1.el8.x86_64                         2/21 
  Verifying        : libibumad-46.0-1.el8.1.x86_64                         3/21 
  Verifying        : libibverbs-46.0-1.el8.1.x86_64                        4/21 
  Verifying        : libnl3-3.7.0-1.el8.x86_64                             5/21 
  Verifying        : libpsm2-11.2.230-1.el8.1.x86_64                       6/21 
  Verifying        : librdmacm-46.0-1.el8.1.x86_64                         7/21 
  Verifying        : numactl-libs-2.0.16-1.el8.x86_64                      8/21 
  Verifying        : opensm-libs-3.3.24-1.el8.x86_64                       9/21 
  Verifying        : lua-json-1.3.2-9.el8.noarch                          10/21 
  Verifying        : lua-lpeg-1.0.1-6.el8.x86_64                          11/21 
  Verifying        : munge-libs-0.5.13-2.el8.x86_64                       12/21 
  Verifying        : openmpi-2:4.1.1-5.el8.x86_64                         13/21 
  Verifying        : openmpi-devel-2:4.1.1-5.el8.x86_64                   14/21 
  Verifying        : pmix-2.2.5-1.el8.x86_64                              15/21 
  Verifying        : rpm-mpi-hooks-8-2.el8.noarch                         16/21 
  Verifying        : ucx-1.14.1-1.el8.1.x86_64                            17/21 
  Verifying        : lua-filesystem-1.6.3-7.el8.x86_64                    18/21 
  Verifying        : lua-posix-33.3.1-9.el8.x86_64                        19/21 
  Verifying        : Lmod-8.7.32-1.el8.x86_64                             20/21 
  Verifying        : lua-term-0.07-9.el8.x86_64                           21/21 

Installed:
  Lmod-8.7.32-1.el8.x86_64                 hwloc-libs-2.2.0-3.el8.x86_64        
  libfabric-1.18.0-1.el8.x86_64            libibumad-46.0-1.el8.1.x86_64        
  libibverbs-46.0-1.el8.1.x86_64           libnl3-3.7.0-1.el8.x86_64            
  libpsm2-11.2.230-1.el8.1.x86_64          librdmacm-46.0-1.el8.1.x86_64        
  lua-filesystem-1.6.3-7.el8.x86_64        lua-json-1.3.2-9.el8.noarch          
  lua-lpeg-1.0.1-6.el8.x86_64              lua-posix-33.3.1-9.el8.x86_64        
  lua-term-0.07-9.el8.x86_64               munge-libs-0.5.13-2.el8.x86_64       
  numactl-libs-2.0.16-1.el8.x86_64         openmpi-2:4.1.1-5.el8.x86_64         
  openmpi-devel-2:4.1.1-5.el8.x86_64       opensm-libs-3.3.24-1.el8.x86_64      
  pmix-2.2.5-1.el8.x86_64                  rpm-mpi-hooks-8-2.el8.noarch         
  ucx-1.14.1-1.el8.1.x86_64               

Complete!
60 files removed
Removing intermediate container 8a31e5962170
 ---> 828023bdde21

This is my pip.packages.txt

prettyprinter
tqdm
tensorrt
cuda-python
tensorflow==2.15.0

i tried including tensorrt_llm==0.9.0 in the pip.packages.txt but it was giving the same error described below

Expected behavior

Container being built successfully, tensorrt_llm being fully installed.

actual behavior

Building wheels for collected packages: mpi4py, rouge_score
  Building wheel for mpi4py (pyproject.toml): started
  Building wheel for mpi4py (pyproject.toml): finished with status 'error'
  error: subprocess-exited-with-error

  × Building wheel for mpi4py (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [148 lines of output]
      running bdist_wheel
      running build
      running build_src
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-310
      creating build/lib.linux-x86_64-cpython-310/mpi4py
      copying src/mpi4py/__init__.py -> build/lib.linux-x86_64-cpython-310/mpi4py
      copying src/mpi4py/__main__.py -> build/lib.linux-x86_64-cpython-310/mpi4py
      copying src/mpi4py/bench.py -> build/lib.linux-x86_64-cpython-310/mpi4py
      copying src/mpi4py/run.py -> build/lib.linux-x86_64-cpython-310/mpi4py
      creating build/lib.linux-x86_64-cpython-310/mpi4py/futures
      copying src/mpi4py/futures/__init__.py -> build/lib.linux-x86_64-cpython-310/mpi4py/futures
      copying src/mpi4py/futures/__main__.py -> build/lib.linux-x86_64-cpython-310/mpi4py/futures
      copying src/mpi4py/futures/_base.py -> build/lib.linux-x86_64-cpython-310/mpi4py/futures
      copying src/mpi4py/futures/_core.py -> build/lib.linux-x86_64-cpython-310/mpi4py/futures
      copying src/mpi4py/futures/_lib.py -> build/lib.linux-x86_64-cpython-310/mpi4py/futures
      copying src/mpi4py/futures/aplus.py -> build/lib.linux-x86_64-cpython-310/mpi4py/futures
      copying src/mpi4py/futures/pool.py -> build/lib.linux-x86_64-cpython-310/mpi4py/futures
      copying src/mpi4py/futures/server.py -> build/lib.linux-x86_64-cpython-310/mpi4py/futures
      creating build/lib.linux-x86_64-cpython-310/mpi4py/util
      copying src/mpi4py/util/__init__.py -> build/lib.linux-x86_64-cpython-310/mpi4py/util
      copying src/mpi4py/util/dtlib.py -> build/lib.linux-x86_64-cpython-310/mpi4py/util
      copying src/mpi4py/util/pkl5.py -> build/lib.linux-x86_64-cpython-310/mpi4py/util
      copying src/mpi4py/MPI.pyi -> build/lib.linux-x86_64-cpython-310/mpi4py
      copying src/mpi4py/__init__.pyi -> build/lib.linux-x86_64-cpython-310/mpi4py
      copying src/mpi4py/__main__.pyi -> build/lib.linux-x86_64-cpython-310/mpi4py
      copying src/mpi4py/bench.pyi -> build/lib.linux-x86_64-cpython-310/mpi4py
      copying src/mpi4py/dl.pyi -> build/lib.linux-x86_64-cpython-310/mpi4py
      copying src/mpi4py/run.pyi -> build/lib.linux-x86_64-cpython-310/mpi4py
      copying src/mpi4py/py.typed -> build/lib.linux-x86_64-cpython-310/mpi4py
      copying src/mpi4py/MPI.pxd -> build/lib.linux-x86_64-cpython-310/mpi4py
      copying src/mpi4py/__init__.pxd -> build/lib.linux-x86_64-cpython-310/mpi4py
      copying src/mpi4py/libmpi.pxd -> build/lib.linux-x86_64-cpython-310/mpi4py
      creating build/lib.linux-x86_64-cpython-310/mpi4py/include
      creating build/lib.linux-x86_64-cpython-310/mpi4py/include/mpi4py
      copying src/mpi4py/include/mpi4py/mpi4py.MPI.h -> build/lib.linux-x86_64-cpython-310/mpi4py/include/mpi4py
      copying src/mpi4py/include/mpi4py/mpi4py.MPI_api.h -> build/lib.linux-x86_64-cpython-310/mpi4py/include/mpi4py
      copying src/mpi4py/include/mpi4py/mpi4py.h -> build/lib.linux-x86_64-cpython-310/mpi4py/include/mpi4py
      copying src/mpi4py/include/mpi4py/mpi4py.i -> build/lib.linux-x86_64-cpython-310/mpi4py/include/mpi4py
      copying src/mpi4py/include/mpi4py/mpi.pxi -> build/lib.linux-x86_64-cpython-310/mpi4py/include/mpi4py
      copying src/mpi4py/futures/__init__.pyi -> build/lib.linux-x86_64-cpython-310/mpi4py/futures
      copying src/mpi4py/futures/__main__.pyi -> build/lib.linux-x86_64-cpython-310/mpi4py/futures
      copying src/mpi4py/futures/_core.pyi -> build/lib.linux-x86_64-cpython-310/mpi4py/futures
      copying src/mpi4py/futures/_lib.pyi -> build/lib.linux-x86_64-cpython-310/mpi4py/futures
      copying src/mpi4py/futures/aplus.pyi -> build/lib.linux-x86_64-cpython-310/mpi4py/futures
      copying src/mpi4py/futures/pool.pyi -> build/lib.linux-x86_64-cpython-310/mpi4py/futures
      copying src/mpi4py/futures/server.pyi -> build/lib.linux-x86_64-cpython-310/mpi4py/futures
      copying src/mpi4py/util/__init__.pyi -> build/lib.linux-x86_64-cpython-310/mpi4py/util
      copying src/mpi4py/util/dtlib.pyi -> build/lib.linux-x86_64-cpython-310/mpi4py/util
      copying src/mpi4py/util/pkl5.pyi -> build/lib.linux-x86_64-cpython-310/mpi4py/util
      running build_clib
      MPI configuration: [mpi] from 'mpi.cfg'
      checking for library 'lmpe' ...
      gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -c _configtest.c -o _configtest.o
      gcc -pthread _configtest.o -llmpe -o _configtest
      /usr/bin/ld: cannot find -llmpe
      collect2: error: ld returned 1 exit status
      failure.
      removing: _configtest.c _configtest.o
      building 'mpe' dylib library
      creating build/temp.linux-x86_64-cpython-310
      creating build/temp.linux-x86_64-cpython-310/src
      creating build/temp.linux-x86_64-cpython-310/src/lib-pmpi
      gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -c src/lib-pmpi/mpe.c -o build/temp.linux-x86_64-cpython-310/src/lib-pmpi/mpe.o
      creating build/lib.linux-x86_64-cpython-310/mpi4py/lib-pmpi
      gcc -pthread -shared -Wl,--no-as-needed build/temp.linux-x86_64-cpython-310/src/lib-pmpi/mpe.o -o build/lib.linux-x86_64-cpython-310/mpi4py/lib-pmpi/libmpe.so
      checking for library 'vt-mpi' ...
      gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -c _configtest.c -o _configtest.o
      gcc -pthread _configtest.o -lvt-mpi -o _configtest
      /usr/bin/ld: cannot find -lvt-mpi
      collect2: error: ld returned 1 exit status
      failure.
      removing: _configtest.c _configtest.o
      checking for library 'vt.mpi' ...
      gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -c _configtest.c -o _configtest.o
      gcc -pthread _configtest.o -lvt.mpi -o _configtest
      /usr/bin/ld: cannot find -lvt.mpi
      collect2: error: ld returned 1 exit status
      failure.
      removing: _configtest.c _configtest.o
      building 'vt' dylib library
      gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -c src/lib-pmpi/vt.c -o build/temp.linux-x86_64-cpython-310/src/lib-pmpi/vt.o
      gcc -pthread -shared -Wl,--no-as-needed build/temp.linux-x86_64-cpython-310/src/lib-pmpi/vt.o -o build/lib.linux-x86_64-cpython-310/mpi4py/lib-pmpi/libvt.so
      checking for library 'vt-mpi' ...
      gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -c _configtest.c -o _configtest.o
      gcc -pthread _configtest.o -lvt-mpi -o _configtest
      /usr/bin/ld: cannot find -lvt-mpi
      collect2: error: ld returned 1 exit status
      failure.
      removing: _configtest.c _configtest.o
      checking for library 'vt.mpi' ...
      gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -c _configtest.c -o _configtest.o
      gcc -pthread _configtest.o -lvt.mpi -o _configtest
      /usr/bin/ld: cannot find -lvt.mpi
      collect2: error: ld returned 1 exit status
      failure.
      removing: _configtest.c _configtest.o
      building 'vt-mpi' dylib library
      gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -c src/lib-pmpi/vt-mpi.c -o build/temp.linux-x86_64-cpython-310/src/lib-pmpi/vt-mpi.o
      gcc -pthread -shared -Wl,--no-as-needed build/temp.linux-x86_64-cpython-310/src/lib-pmpi/vt-mpi.o -o build/lib.linux-x86_64-cpython-310/mpi4py/lib-pmpi/libvt-mpi.so
      checking for library 'vt-hyb' ...
      gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -c _configtest.c -o _configtest.o
      gcc -pthread _configtest.o -lvt-hyb -o _configtest
      /usr/bin/ld: cannot find -lvt-hyb
      collect2: error: ld returned 1 exit status
      failure.
      removing: _configtest.c _configtest.o
      checking for library 'vt.ompi' ...
      gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -c _configtest.c -o _configtest.o
      gcc -pthread _configtest.o -lvt.ompi -o _configtest
      /usr/bin/ld: cannot find -lvt.ompi
      collect2: error: ld returned 1 exit status
      failure.
      removing: _configtest.c _configtest.o
      building 'vt-hyb' dylib library
      gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -c src/lib-pmpi/vt-hyb.c -o build/temp.linux-x86_64-cpython-310/src/lib-pmpi/vt-hyb.o
      gcc -pthread -shared -Wl,--no-as-needed build/temp.linux-x86_64-cpython-310/src/lib-pmpi/vt-hyb.o -o build/lib.linux-x86_64-cpython-310/mpi4py/lib-pmpi/libvt-hyb.so
      running build_ext
      MPI configuration: [mpi] from 'mpi.cfg'
      checking for dlopen() availability ...
      checking for header 'dlfcn.h' ...
      gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/opt/dataiku/code-env/include -I/usr/local/include/python3.10 -c _configtest.c -o _configtest.o
      success!
      removing: _configtest.c _configtest.o
      success!
      checking for library 'dl' ...
      gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/opt/dataiku/code-env/include -I/usr/local/include/python3.10 -c _configtest.c -o _configtest.o
      gcc -pthread _configtest.o -Lbuild/temp.linux-x86_64-cpython-310 -ldl -o _configtest
      success!
      removing: _configtest.c _configtest.o _configtest
      checking for function 'dlopen' ...
      gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/opt/dataiku/code-env/include -I/usr/local/include/python3.10 -c _configtest.c -o _configtest.o
      gcc -pthread _configtest.o -Lbuild/temp.linux-x86_64-cpython-310 -ldl -o _configtest
      success!
      removing: _configtest.c _configtest.o _configtest
      building 'mpi4py.dl' extension
      gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -DHAVE_DLFCN_H=1 -DHAVE_DLOPEN=1 -I/opt/dataiku/code-env/include -I/usr/local/include/python3.10 -c src/dynload.c -o build/temp.linux-x86_64-cpython-310/src/dynload.o
      gcc -pthread -shared build/temp.linux-x86_64-cpython-310/src/dynload.o -Lbuild/temp.linux-x86_64-cpython-310 -ldl -o build/lib.linux-x86_64-cpython-310/mpi4py/dl.cpython-310-x86_64-linux-gnu.so
      checking for MPI compile and link ...
      gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/opt/dataiku/code-env/include -I/usr/local/include/python3.10 -c _configtest.c -o _configtest.o
      _configtest.c:2:10: fatal error: mpi.h: No such file or directory
       #include <mpi.h>
                ^~~~~~~
      compilation terminated.
      failure.
      removing: _configtest.c _configtest.o
      error: Cannot compile MPI programs. Check your configuration!!!
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for mpi4py
  Building wheel for rouge_score (setup.py): started
  Building wheel for rouge_score (setup.py): finished with status 'done'
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=6f9c3cceac57aec01e2e14e7a6f3fc4022a9e692115fc9cdbc10f31b13f36a71
  Stored in directory: /root/.cache/pip/wheels/3e/94/5c/7ff8a51c53c1bbc8df4cac58aa4990ffbc6fa203e9f0808fdd
Successfully built rouge_score
Failed to build mpi4py
ERROR: Could not build wheels for mpi4py, which is required to install pyproject.toml-based projects

[notice] A new release of pip available: 22.3.1 -> 24.0
[notice] To update, run: /opt/dataiku/code-env/bin/python -m pip install --upgrade pip
Removing intermediate container 2d1c1ed7e4ee
The command '/opt/dataiku/code-env/bin/python -m pip install -r /tmp/TensorRT-LLM/requirements-dev.txt' returned a non-zero code: 1

additional notes

None

byshiue commented 1 month ago

Do you use the officail docker file? If not, could you take a try?

RamishREGN commented 1 month ago

Hi @byshiue i have to stick to dku-exec-base-dl:dss-12.3.2 this base image. However all the dependencies required for tensorRT-LLM i have appended to the Docker image.

byshiue commented 1 month ago

Since we cannot reproduce the issue on our side, we could only try our best to provide help.

Could you check that is mpi.h included in include path when you build engine?

RamishREGN commented 1 month ago

This is the whole Docker image, you can probably exclude a few things from it you just need to install the python 3.10, you can install it yourself or with the files i provided.

This are supporting files. build-python310.sh

#!/bin/bash -e
# Install a locally-compiled version of Python 3.10 in a CentOS 7 or 8 container image

PYTHON_VERSION="3.10.13"
PYTHON_MD5="cbcad7f5e759176bf8ce8a5f9d487087"

TMPDIR="/tmp.build-python310"

yum -y install \
  @development \
  bzip2-devel \
  gdbm-devel \
  libffi-devel \
  libuuid-devel \
  ncurses-devel \
  readline-devel \
  sqlite-devel \
  xz-devel \
  zlib-devel \

# Python 3.10 requires OpenSSL 1.1, which is not natively available on CentOS 7
# Install version from EPEL and build an OpenSSL directory compatible with Python compilation
. /etc/os-release
case "$VERSION_ID" in
  7*)
    yum -y install epel-release
    yum -y install openssl11-devel
    mkdir -p /usr/local/openssl11
    test -e /usr/local/openssl11/include ||
      ln -s /usr/include/openssl11 /usr/local/openssl11/include
    test -e /usr/local/openssl11/lib ||
      ln -s /usr/lib64/openssl11 /usr/local/openssl11/lib
    configureOpts="--with-openssl=/usr/local/openssl11"
    ;;
  8*)
    yum -y install openssl-devel
    configureOpts=
    ;;
  *)
    echo 2>&1 'OS version not supported'
    exit 1
    ;;
esac

mkdir -p "$TMPDIR"
cd "$TMPDIR"

curl -OsS "https://www.python.org/ftp/python/$PYTHON_VERSION/Python-$PYTHON_VERSION.tgz"
echo "$PYTHON_MD5 Python-$PYTHON_VERSION.tgz" | md5sum -c

tar xf Python-"$PYTHON_VERSION".tgz
cd Python-"$PYTHON_VERSION"

./configure --enable-ipv6 $configureOpts
make -j 4
make altinstall

# Update built-in packages
/usr/local/bin/python3.10 -m pip install --upgrade pip setuptools

# Remove test module, except test.support which might be needed by additional packages
(cd /usr/local/lib/python3.10/test; ls | grep -vx support | xargs rm -rf)

cd /
rm -rf "$TMPDIR"
yum clean all

_create-virtualenv.sh

#!/bin/bash -e
# Create a new Python virtual environment

MYDIR=$(cd "$(dirname "$0")" && pwd -P)

if [ $# -lt 1 ]; then
        echo >&2 "Usage: $0 PYTHONBIN VIRTUALENV_ARG ..."
        exit 1
fi
pythonBin="$1"
shift

pythonVersion=$("$pythonBin" -c "import sysconfig;print(sysconfig.get_python_version())")
case "$pythonVersion" in
        2.7)
                virtualenv="$MYDIR"/virtualenv-2.7.pyz
                ;;
        3.6)
                virtualenv="$MYDIR"/virtualenv-3.6.pyz
                ;;
        3.*)
                virtualenv="$MYDIR"/virtualenv.pyz
                ;;
        *)
                echo >&2 "*** Python version not supported : $pythonVersion"
                exit 1
                ;;
esac

exec "$pythonBin" "$virtualenv" "$@"

install-builtin-env-python-packages.sh

#!/bin/bash -e
MYDIR=`dirname $0`
MYDIR=`cd $MYDIR && pwd -P`

pip="$1"
requirementsFile="$MYDIR/requirements-py310.txt"
echo "+ Upgrading pip"
$pip install --upgrade pip wheel

echo "+ Installing required Python packages ..."
$pip install -r "$requirementsFile"

# Remove stuff installed only temporarily or that we don't want
echo "+ Cleaning up ..."

# nbconvert and notebook are not actually needed in the container
$pip uninstall --yes nbconvert notebook widgetsnbextension

# The "webagg" backend has a deprecated jQuery-UI that triggers security scanners, nuke it
rm -rf /opt/dataiku/pyenv/lib*/python*/site-packages/matplotlib/backends/web_backend
# The "tornado" library has a test file that falsely triggers security scanners [sc-113388]
rm -rf /opt/dataiku/pyenv/lib*/python*/site-packages/tornado/test/test.key

requirements-py310.txt

# Basic dataframe
numpy>=1.21,<1.22
pandas>=1.3,<1.4

# Pandas performance
numexpr>=2.8,<2.9
bottleneck>=1.3,<1.4

# Jupyter Kernel
traitlets>=5.0,<5.2
jupyter-core>=4.6,<5
jupyter-client>=6.1.5,<7.0

pyzmq>=23,<24

ipython_genutils>=0.2
ipython>=5.10,<7.17
ipykernel>=4.8,<4.9
Send2Trash>=1.5,<1.6
notebook==5.7.0 # temporary install of old notebook, we'll remove it afterwards - here so that ipywidgets does not try to install it
ipywidgets>=7.1,<7.2

# Easy deps that don't pull horrible deps
# urllib3 >= 2.0 drops support for openssl < 1.1.1, see https://github.com/urllib3/urllib3/issues/2168
urllib3<2
requests>=2.25,<3
python-docx>=0.8,<0.9
cloudpickle>=1.3,<1.6
tabulate>=0.8,<0.9

sortedcontainers>=2.1,<2.2

# Webapps
tornado<6
flask<2.3 # Flask 2.2 pulls click 8, istsdang. 2.1, jinja2 3.1, werkzeug 2.2.2, markupsafe 2.1 but we force tornado<6 because it's required for Jupyter 5
jinja2>=3.0,<3.1 # And jinja 3.1 is too recent too for nbconvert

# Visual ML
lightgbm>=3.2,<3.3
scipy>=1.7,<1.8
scikit-learn>=1.0,<1.1
xgboost>=0.82,<0.83

# Other
matplotlib>=3.6,<3.7
statsmodels>=0.13,<0.14

# TEMPORARY
pipdeptree

DockerFile

FROM almalinux:8
CMD ["/bin/bash"]
WORKDIR /opt/dataiku

RUN /bin/sh -c . /etc/os-release && case "$VERSION_ID" in         7*) echo $'[nginx-stable]\nname=nginx stable repo\nbaseurl=http://nginx.org/packages/centos/$releasever/$basearch/\ngpgcheck=1\nenabled=1\ngpgkey=https://nginx.org/keys/nginx_signing.key\nmodule_hotfixes=true' > /etc/yum.repos.d/nginx.repo;;         8*) dnf -qy module enable nginx:1.22;;         *) echo 2>&1 'OS version not supported'; exit 1;;        esac # buildkit
RUN /bin/sh -c yum -y update     && yum -y install epel-release     && . /etc/os-release && case "$VERSION_ID" in         7*) yum -y install procps python3-devel python-devel;;         8*) yum -y install procps-ng python36-devel glibc-langpack-en python2-devel;;         *) echo 2>&1 'OS version not supported'; exit 1;;        esac     && yum -y install curl util-linux bzip2 nginx expat zip unzip freetype libgfortran libgomp libicu-devel libcurl-devel openssl-devel libxml2-devel  mesa-libGL python3-ldap openslide python3-devel openldap-devel cyrus-sasl-devel libevent-devel     && yum -y groupinstall "Development tools"     && yum -y autoremove     && yum clean all # buildkit

COPY build-python310.sh build/ # buildkit
RUN /bin/sh -c build/build-python310.sh >/tmp/build-python.log && rm -f /tmp/build-python.log # buildkit

COPY _create-virtualenv.sh virtualenv*.pyz install-builtin-env-python-packages.sh resources/builtin-python-env/container-images/ build/ # buildkit
RUN /bin/sh -c build/_create-virtualenv.sh python3.10 pyenv &&     build/install-builtin-env-python-packages.sh pyenv/bin/pip &&     mkdir -p bin &&     echo -e '#!/bin/bash -e\nexec /opt/dataiku/pyenv/bin/python "$@"' >bin/python &&     chmod a+x bin/python &&     rm -rf ~/.cache/pip # buildkit

COPY dataiku python/dataiku # buildkit
COPY dataikuapi python/dataikuapi # buildkit
COPY dataikuscoring python/dataikuscoring # buildkit
COPY dataiku_code_assistant python/dataiku_code_assistant # buildkit

RUN /bin/sh -c bin/python  -m compileall -f python || echo "[-] Error precompiling Dataiku Python code (ignored)" # buildkit
ENV PYTHONPATH=/opt/dataiku/python
COPY web/ /opt/dataiku/web/ # buildkit
COPY resources/nlp /opt/dataiku/resources/nlp/ # buildkit
WORKDIR /home/dataiku
COPY dss-version.json /opt/dataiku/ # buildkit
ENV DIP_HOME=/home/dataiku/fake_dip_home
RUN /bin/sh -c groupadd -r dataiku     && useradd -r -g dataiku -d /home/dataiku dataiku     && mkdir fake_dip_home fake_dip_home/tmp lib lib/project lib/instance plugin     && chown -Rh dataiku:dataiku /home/dataiku # buildkit
RUN /bin/sh -c chgrp -R 0 /home/dataiku && chmod -R 775 /home/dataiku # buildkit
ENV DKU_CONTAINER_EXEC=1
USER dataiku
ENTRYPOINT ["/opt/dataiku/bin/python" "-m" "dataiku.container.runner"]

USER root
WORKDIR /opt/dataiku
ENV PYTHONPATH=
ENV R_LIBS_USER=

# Env-specific prepend Dockerfile fragment
# ENV VARS
ENV NV_CUDA_LIB_VERSION 12.2.2-1
ENV NV_NVTX_VERSION 12.2.140-1
ENV NV_LIBNPP_VERSION 12.2.1.4-1
ENV NV_LIBNPP_PACKAGE libnpp-12-2-${NV_LIBNPP_VERSION}
ENV NV_LIBCUBLAS_VERSION 12.2.5.6-1
ENV NV_LIBNCCL_PACKAGE_NAME libnccl
ENV NV_LIBNCCL_PACKAGE_VERSION 2.19.3-1
ENV NV_LIBNCCL_PACKAGE ${NV_LIBNCCL_PACKAGE_NAME}-${NV_LIBNCCL_PACKAGE_VERSION}+cuda12.2
ENV NV_CUDNN_VERSION 8.9.7.29-1
ENV NV_CUDNN_PACKAGE libcudnn8-${NV_CUDNN_VERSION}
RUN . /etc/os-release && case "$VERSION_ID" in \
        7*) yum install -y yum-utils && \
            yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo && \
            yum clean all;; \
        8*) dnf install -y dnf-plugins-core && \
            dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo && \
            dnf clean all;; \
        *) echo 2>&1 'OS version not supported'; exit 1;; \
    esac

# For libraries in the cuda-compat-* package: https://docs.nvidia.com/cuda/eula/index.html#attachment-a
# NCCL and NVTX are necessary for GPU execution of MxNet models (time series forecasting)
RUN yum install -y \
    cuda-libraries-12-2-${NV_CUDA_LIB_VERSION} \
    cuda-nvtx-12-2-${NV_NVTX_VERSION} \
    ${NV_LIBNPP_PACKAGE} \
    libcublas-12-2-${NV_LIBCUBLAS_VERSION} \
    ${NV_LIBNCCL_PACKAGE} \
    && yum clean all \
    && rm -rf /var/cache/yum/*

# On GKE, TF needs explicit directions on where to find cuda runtime
ENV PATH=/usr/local/cuda/bin:${PATH} \
    LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}

# EKS requires this in order to expose the CUDA driver in the container
# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-gpu.html#gpu-considerations
ENV NVIDIA_DRIVER_CAPABILITIES=utility,compute

# GKE containers expose the CUDA driver at this location
# https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#cuda
ENV PATH=/usr/local/nvidia/bin:$PATH \
    LD_LIBRARY_PATH=/usr/local/nvidia/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}

ENV CUDNN_VERSION 8.9.7.29-1
# The libcudnn8 package needs to be versionlocked to ensure it stays in sync with the chosen cuda version
RUN yum install -y \
    ${NV_CUDNN_PACKAGE}.cuda12.2 \
    && yum clean all \
    && rm -rf /var/cache/yum/*

ENV DKU_CONTAINER_EXEC=1
USER dataiku
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda-12.2/targets/x86_64-linux/lib:/usr/lib64
USER root
WORKDIR /opt/dataiku
ENV PYTHONPATH=
ENV R_LIBS_USER=

# End of env-specific prepend Dockerfile fragment
ENV DKU_IMAGE_BUILD_TIMESTAMP=1716372517369

# Virtualenv initialization
RUN ["build/_create-virtualenv.sh", "bin/python", "-p", "python3.10", "code-env"]

# Env-specific before-packages Dockerfile fragment
RUN yum install -y  openmpi-devel \
     && yum clean all \
     && rm -rf /var/cache/yum/*

# End of env-specific before-packages Dockerfile fragment

# Pip packages
COPY code-env/pip.packages.txt code-env/
RUN ["/opt/dataiku/code-env/bin/python", "-m", "pip", "install", "-r", "code-env/pip.packages.txt"]
COPY dku-codeenv-ref.json code-env/
# Env-specific after-packages Dockerfile fragment

#RUN ["/opt/dataiku/code-env/bin/python", "-m", "pip", "install", "git+https://github.com/NVIDIA/TensorRT-LLM@main"]
RUN yum install -y git
RUN curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | bash && \
    dnf -y install git-lfs && \
    git lfs install
WORKDIR /tmp
RUN git clone https://github.com/NVIDIA/TensorRT-LLM.git
WORKDIR /tmp/TensorRT-LLM
RUN ["/opt/dataiku/code-env/bin/python", "-m", "pip", "install", "-r", "/tmp/TensorRT-LLM/requirements-dev.txt"]
RUN git lfs install
RUN rm -rf /tmp/TensorRT-LLM
WORKDIR /opt/dataiku
# End of env-specific after-packages Dockerfile fragment

# Code environment resources
RUN mkdir -p /opt/dataiku/code-env/resources
RUN chown -R dataiku /opt/dataiku/code-env/resources
ENV PYTHONPATH=/opt/dataiku/python
ENV CODE_ENV_PYTHONPATH=/opt/dataiku/code-env
#ENV R_LIBS_USER=${DKU_R_DATAIKU_PACKAGES_PATH}
USER dataiku
WORKDIR /home/dataiku
Shixiaowei02 commented 1 month ago

Please install openmpi correctly or use the try-llm docker image directly. Thank you!

apt-get install -y openmpi-bin libopenmpi-dev