Open RamishREGN opened 1 month ago
Do you use the officail docker file? If not, could you take a try?
Hi @byshiue i have to stick to dku-exec-base-dl:dss-12.3.2 this base image. However all the dependencies required for tensorRT-LLM i have appended to the Docker image.
Since we cannot reproduce the issue on our side, we could only try our best to provide help.
Could you check that is mpi.h
included in include path
when you build engine?
This is the whole Docker image, you can probably exclude a few things from it you just need to install the python 3.10, you can install it yourself or with the files i provided.
This are supporting files. build-python310.sh
#!/bin/bash -e
# Install a locally-compiled version of Python 3.10 in a CentOS 7 or 8 container image
PYTHON_VERSION="3.10.13"
PYTHON_MD5="cbcad7f5e759176bf8ce8a5f9d487087"
TMPDIR="/tmp.build-python310"
yum -y install \
@development \
bzip2-devel \
gdbm-devel \
libffi-devel \
libuuid-devel \
ncurses-devel \
readline-devel \
sqlite-devel \
xz-devel \
zlib-devel \
# Python 3.10 requires OpenSSL 1.1, which is not natively available on CentOS 7
# Install version from EPEL and build an OpenSSL directory compatible with Python compilation
. /etc/os-release
case "$VERSION_ID" in
7*)
yum -y install epel-release
yum -y install openssl11-devel
mkdir -p /usr/local/openssl11
test -e /usr/local/openssl11/include ||
ln -s /usr/include/openssl11 /usr/local/openssl11/include
test -e /usr/local/openssl11/lib ||
ln -s /usr/lib64/openssl11 /usr/local/openssl11/lib
configureOpts="--with-openssl=/usr/local/openssl11"
;;
8*)
yum -y install openssl-devel
configureOpts=
;;
*)
echo 2>&1 'OS version not supported'
exit 1
;;
esac
mkdir -p "$TMPDIR"
cd "$TMPDIR"
curl -OsS "https://www.python.org/ftp/python/$PYTHON_VERSION/Python-$PYTHON_VERSION.tgz"
echo "$PYTHON_MD5 Python-$PYTHON_VERSION.tgz" | md5sum -c
tar xf Python-"$PYTHON_VERSION".tgz
cd Python-"$PYTHON_VERSION"
./configure --enable-ipv6 $configureOpts
make -j 4
make altinstall
# Update built-in packages
/usr/local/bin/python3.10 -m pip install --upgrade pip setuptools
# Remove test module, except test.support which might be needed by additional packages
(cd /usr/local/lib/python3.10/test; ls | grep -vx support | xargs rm -rf)
cd /
rm -rf "$TMPDIR"
yum clean all
_create-virtualenv.sh
#!/bin/bash -e
# Create a new Python virtual environment
MYDIR=$(cd "$(dirname "$0")" && pwd -P)
if [ $# -lt 1 ]; then
echo >&2 "Usage: $0 PYTHONBIN VIRTUALENV_ARG ..."
exit 1
fi
pythonBin="$1"
shift
pythonVersion=$("$pythonBin" -c "import sysconfig;print(sysconfig.get_python_version())")
case "$pythonVersion" in
2.7)
virtualenv="$MYDIR"/virtualenv-2.7.pyz
;;
3.6)
virtualenv="$MYDIR"/virtualenv-3.6.pyz
;;
3.*)
virtualenv="$MYDIR"/virtualenv.pyz
;;
*)
echo >&2 "*** Python version not supported : $pythonVersion"
exit 1
;;
esac
exec "$pythonBin" "$virtualenv" "$@"
install-builtin-env-python-packages.sh
#!/bin/bash -e
MYDIR=`dirname $0`
MYDIR=`cd $MYDIR && pwd -P`
pip="$1"
requirementsFile="$MYDIR/requirements-py310.txt"
echo "+ Upgrading pip"
$pip install --upgrade pip wheel
echo "+ Installing required Python packages ..."
$pip install -r "$requirementsFile"
# Remove stuff installed only temporarily or that we don't want
echo "+ Cleaning up ..."
# nbconvert and notebook are not actually needed in the container
$pip uninstall --yes nbconvert notebook widgetsnbextension
# The "webagg" backend has a deprecated jQuery-UI that triggers security scanners, nuke it
rm -rf /opt/dataiku/pyenv/lib*/python*/site-packages/matplotlib/backends/web_backend
# The "tornado" library has a test file that falsely triggers security scanners [sc-113388]
rm -rf /opt/dataiku/pyenv/lib*/python*/site-packages/tornado/test/test.key
requirements-py310.txt
# Basic dataframe
numpy>=1.21,<1.22
pandas>=1.3,<1.4
# Pandas performance
numexpr>=2.8,<2.9
bottleneck>=1.3,<1.4
# Jupyter Kernel
traitlets>=5.0,<5.2
jupyter-core>=4.6,<5
jupyter-client>=6.1.5,<7.0
pyzmq>=23,<24
ipython_genutils>=0.2
ipython>=5.10,<7.17
ipykernel>=4.8,<4.9
Send2Trash>=1.5,<1.6
notebook==5.7.0 # temporary install of old notebook, we'll remove it afterwards - here so that ipywidgets does not try to install it
ipywidgets>=7.1,<7.2
# Easy deps that don't pull horrible deps
# urllib3 >= 2.0 drops support for openssl < 1.1.1, see https://github.com/urllib3/urllib3/issues/2168
urllib3<2
requests>=2.25,<3
python-docx>=0.8,<0.9
cloudpickle>=1.3,<1.6
tabulate>=0.8,<0.9
sortedcontainers>=2.1,<2.2
# Webapps
tornado<6
flask<2.3 # Flask 2.2 pulls click 8, istsdang. 2.1, jinja2 3.1, werkzeug 2.2.2, markupsafe 2.1 but we force tornado<6 because it's required for Jupyter 5
jinja2>=3.0,<3.1 # And jinja 3.1 is too recent too for nbconvert
# Visual ML
lightgbm>=3.2,<3.3
scipy>=1.7,<1.8
scikit-learn>=1.0,<1.1
xgboost>=0.82,<0.83
# Other
matplotlib>=3.6,<3.7
statsmodels>=0.13,<0.14
# TEMPORARY
pipdeptree
DockerFile
FROM almalinux:8
CMD ["/bin/bash"]
WORKDIR /opt/dataiku
RUN /bin/sh -c . /etc/os-release && case "$VERSION_ID" in 7*) echo $'[nginx-stable]\nname=nginx stable repo\nbaseurl=http://nginx.org/packages/centos/$releasever/$basearch/\ngpgcheck=1\nenabled=1\ngpgkey=https://nginx.org/keys/nginx_signing.key\nmodule_hotfixes=true' > /etc/yum.repos.d/nginx.repo;; 8*) dnf -qy module enable nginx:1.22;; *) echo 2>&1 'OS version not supported'; exit 1;; esac # buildkit
RUN /bin/sh -c yum -y update && yum -y install epel-release && . /etc/os-release && case "$VERSION_ID" in 7*) yum -y install procps python3-devel python-devel;; 8*) yum -y install procps-ng python36-devel glibc-langpack-en python2-devel;; *) echo 2>&1 'OS version not supported'; exit 1;; esac && yum -y install curl util-linux bzip2 nginx expat zip unzip freetype libgfortran libgomp libicu-devel libcurl-devel openssl-devel libxml2-devel mesa-libGL python3-ldap openslide python3-devel openldap-devel cyrus-sasl-devel libevent-devel && yum -y groupinstall "Development tools" && yum -y autoremove && yum clean all # buildkit
COPY build-python310.sh build/ # buildkit
RUN /bin/sh -c build/build-python310.sh >/tmp/build-python.log && rm -f /tmp/build-python.log # buildkit
COPY _create-virtualenv.sh virtualenv*.pyz install-builtin-env-python-packages.sh resources/builtin-python-env/container-images/ build/ # buildkit
RUN /bin/sh -c build/_create-virtualenv.sh python3.10 pyenv && build/install-builtin-env-python-packages.sh pyenv/bin/pip && mkdir -p bin && echo -e '#!/bin/bash -e\nexec /opt/dataiku/pyenv/bin/python "$@"' >bin/python && chmod a+x bin/python && rm -rf ~/.cache/pip # buildkit
COPY dataiku python/dataiku # buildkit
COPY dataikuapi python/dataikuapi # buildkit
COPY dataikuscoring python/dataikuscoring # buildkit
COPY dataiku_code_assistant python/dataiku_code_assistant # buildkit
RUN /bin/sh -c bin/python -m compileall -f python || echo "[-] Error precompiling Dataiku Python code (ignored)" # buildkit
ENV PYTHONPATH=/opt/dataiku/python
COPY web/ /opt/dataiku/web/ # buildkit
COPY resources/nlp /opt/dataiku/resources/nlp/ # buildkit
WORKDIR /home/dataiku
COPY dss-version.json /opt/dataiku/ # buildkit
ENV DIP_HOME=/home/dataiku/fake_dip_home
RUN /bin/sh -c groupadd -r dataiku && useradd -r -g dataiku -d /home/dataiku dataiku && mkdir fake_dip_home fake_dip_home/tmp lib lib/project lib/instance plugin && chown -Rh dataiku:dataiku /home/dataiku # buildkit
RUN /bin/sh -c chgrp -R 0 /home/dataiku && chmod -R 775 /home/dataiku # buildkit
ENV DKU_CONTAINER_EXEC=1
USER dataiku
ENTRYPOINT ["/opt/dataiku/bin/python" "-m" "dataiku.container.runner"]
USER root
WORKDIR /opt/dataiku
ENV PYTHONPATH=
ENV R_LIBS_USER=
# Env-specific prepend Dockerfile fragment
# ENV VARS
ENV NV_CUDA_LIB_VERSION 12.2.2-1
ENV NV_NVTX_VERSION 12.2.140-1
ENV NV_LIBNPP_VERSION 12.2.1.4-1
ENV NV_LIBNPP_PACKAGE libnpp-12-2-${NV_LIBNPP_VERSION}
ENV NV_LIBCUBLAS_VERSION 12.2.5.6-1
ENV NV_LIBNCCL_PACKAGE_NAME libnccl
ENV NV_LIBNCCL_PACKAGE_VERSION 2.19.3-1
ENV NV_LIBNCCL_PACKAGE ${NV_LIBNCCL_PACKAGE_NAME}-${NV_LIBNCCL_PACKAGE_VERSION}+cuda12.2
ENV NV_CUDNN_VERSION 8.9.7.29-1
ENV NV_CUDNN_PACKAGE libcudnn8-${NV_CUDNN_VERSION}
RUN . /etc/os-release && case "$VERSION_ID" in \
7*) yum install -y yum-utils && \
yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo && \
yum clean all;; \
8*) dnf install -y dnf-plugins-core && \
dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo && \
dnf clean all;; \
*) echo 2>&1 'OS version not supported'; exit 1;; \
esac
# For libraries in the cuda-compat-* package: https://docs.nvidia.com/cuda/eula/index.html#attachment-a
# NCCL and NVTX are necessary for GPU execution of MxNet models (time series forecasting)
RUN yum install -y \
cuda-libraries-12-2-${NV_CUDA_LIB_VERSION} \
cuda-nvtx-12-2-${NV_NVTX_VERSION} \
${NV_LIBNPP_PACKAGE} \
libcublas-12-2-${NV_LIBCUBLAS_VERSION} \
${NV_LIBNCCL_PACKAGE} \
&& yum clean all \
&& rm -rf /var/cache/yum/*
# On GKE, TF needs explicit directions on where to find cuda runtime
ENV PATH=/usr/local/cuda/bin:${PATH} \
LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
# EKS requires this in order to expose the CUDA driver in the container
# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-gpu.html#gpu-considerations
ENV NVIDIA_DRIVER_CAPABILITIES=utility,compute
# GKE containers expose the CUDA driver at this location
# https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#cuda
ENV PATH=/usr/local/nvidia/bin:$PATH \
LD_LIBRARY_PATH=/usr/local/nvidia/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
ENV CUDNN_VERSION 8.9.7.29-1
# The libcudnn8 package needs to be versionlocked to ensure it stays in sync with the chosen cuda version
RUN yum install -y \
${NV_CUDNN_PACKAGE}.cuda12.2 \
&& yum clean all \
&& rm -rf /var/cache/yum/*
ENV DKU_CONTAINER_EXEC=1
USER dataiku
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda-12.2/targets/x86_64-linux/lib:/usr/lib64
USER root
WORKDIR /opt/dataiku
ENV PYTHONPATH=
ENV R_LIBS_USER=
# End of env-specific prepend Dockerfile fragment
ENV DKU_IMAGE_BUILD_TIMESTAMP=1716372517369
# Virtualenv initialization
RUN ["build/_create-virtualenv.sh", "bin/python", "-p", "python3.10", "code-env"]
# Env-specific before-packages Dockerfile fragment
RUN yum install -y openmpi-devel \
&& yum clean all \
&& rm -rf /var/cache/yum/*
# End of env-specific before-packages Dockerfile fragment
# Pip packages
COPY code-env/pip.packages.txt code-env/
RUN ["/opt/dataiku/code-env/bin/python", "-m", "pip", "install", "-r", "code-env/pip.packages.txt"]
COPY dku-codeenv-ref.json code-env/
# Env-specific after-packages Dockerfile fragment
#RUN ["/opt/dataiku/code-env/bin/python", "-m", "pip", "install", "git+https://github.com/NVIDIA/TensorRT-LLM@main"]
RUN yum install -y git
RUN curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | bash && \
dnf -y install git-lfs && \
git lfs install
WORKDIR /tmp
RUN git clone https://github.com/NVIDIA/TensorRT-LLM.git
WORKDIR /tmp/TensorRT-LLM
RUN ["/opt/dataiku/code-env/bin/python", "-m", "pip", "install", "-r", "/tmp/TensorRT-LLM/requirements-dev.txt"]
RUN git lfs install
RUN rm -rf /tmp/TensorRT-LLM
WORKDIR /opt/dataiku
# End of env-specific after-packages Dockerfile fragment
# Code environment resources
RUN mkdir -p /opt/dataiku/code-env/resources
RUN chown -R dataiku /opt/dataiku/code-env/resources
ENV PYTHONPATH=/opt/dataiku/python
ENV CODE_ENV_PYTHONPATH=/opt/dataiku/code-env
#ENV R_LIBS_USER=${DKU_R_DATAIKU_PACKAGES_PATH}
USER dataiku
WORKDIR /home/dataiku
Please install openmpi correctly or use the try-llm docker image directly. Thank you!
apt-get install -y openmpi-bin libopenmpi-dev
System Info
Who can help?
@byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Here is the docker image:
Output logs If you see mpi4py dependency which is openmpi-devel which is installed
This is my pip.packages.txt
i tried including tensorrt_llm==0.9.0 in the pip.packages.txt but it was giving the same error described below
Expected behavior
Container being built successfully, tensorrt_llm being fully installed.
actual behavior
additional notes
None