Hey guys, I have rhoai-2.10 running on my cluster which has 3 nodes with a single GPU per node, however, I want it to accept nccl env variables I will explain:
currently, this default image can only use the podnetwork , and for that reason, I'm unable to utilize my Infiniband NICs, I need the option to set NCCL to use my IB NICs on the pod ( NCCL_SOCKET_IFNAME=net1,UCX_NET_DEVICES=net1 in my case ).
I created the below docker file to compile pytorch with USE_SYSTEM_NCCL=1 , however, things got really messy, I'm wondering if anyone here knows of a simple solution to this issue.
## Global Args #################################################################
ARG BASE_UBI_IMAGE_TAG=latest
ARG USER=tuning
ARG USER_UID=1000
ARG PYTHON_VERSION=3.11
ARG WHEEL_VERSION=""
## Base Layer ##################################################################
FROM nvcr.io/nvidia/cuda:12.1.0-base-ubi9 as base
#FROM registry.access.redhat.com/ubi9/ubi:${BASE_UBI_IMAGE_TAG} as base
ARG PYTHON_VERSION
ARG USER
ARG USER_UID
RUN dnf remove -y --disableplugin=subscription-manager \
subscription-manager \
&& dnf install -y python${PYTHON_VERSION} procps \
&& ln -s /usr/bin/python${PYTHON_VERSION} /bin/python \
&& python -m ensurepip --upgrade \
&& python -m pip install --upgrade pip \
&& dnf update -y \
&& dnf clean all
ENV LANG=C.UTF-8 \
LC_ALL=C.UTF-8
RUN useradd -u $USER_UID ${USER} -m -g 0 --system && \
chmod g+rx /home/${USER}
## Used as base of the Release stage to removed unrelated the packages and CVEs
FROM base as release-base
# Removes the python3.9 code to eliminate possible CVEs. Also removes dnf
RUN rpm -e $(dnf repoquery python3-* -q --installed) dnf python3 yum crypto-policies-scripts
## CUDA Base ###################################################################
FROM base as cuda-base
# Ref: https://docs.nvidia.com/cuda/archive/12.1.0/cuda-toolkit-release-notes/
ENV CUDA_VERSION=12.1.0 \
NV_CUDA_LIB_VERSION=12.1.0-1 \
NVIDIA_VISIBLE_DEVICES=all \
NVIDIA_DRIVER_CAPABILITIES=compute,utility \
NV_CUDA_CUDART_VERSION=12.1.55-1 \
NV_CUDA_COMPAT_VERSION=530.30.02-1
RUN dnf config-manager \
--add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo \
&& dnf install -y \
cuda-cudart-12-1-${NV_CUDA_CUDART_VERSION} \
cuda-compat-12-1-${NV_CUDA_COMPAT_VERSION} \
&& echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \
&& echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \
&& dnf clean all
ENV CUDA_HOME="/usr/local/cuda" \
PATH="/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH}" \
LD_LIBRARY_PATH="/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:${LD_LIBRARY_PATH}"
## CUDA Development ############################################################
FROM cuda-base as cuda-devel
# Ref: https://developer.nvidia.com/nccl/nccl-legacy-downloads
ENV NV_CUDA_CUDART_DEV_VERSION=12.1.55-1 \
NV_NVML_DEV_VERSION=12.1.55-1 \
NV_LIBCUBLAS_DEV_VERSION=12.1.0.26-1 \
NV_LIBNPP_DEV_VERSION=12.0.2.50-1 \
NV_LIBNCCL_DEV_PACKAGE_VERSION=2.18.3-1+cuda12.1
RUN dnf config-manager \
--add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo \
&& dnf install -y \
cuda-command-line-tools-12-1-${NV_CUDA_LIB_VERSION} \
cuda-libraries-devel-12-1-${NV_CUDA_LIB_VERSION} \
cuda-minimal-build-12-1-${NV_CUDA_LIB_VERSION} \
cuda-cudart-devel-12-1-${NV_CUDA_CUDART_DEV_VERSION} \
cuda-nvml-devel-12-1-${NV_NVML_DEV_VERSION} \
libcublas-devel-12-1-${NV_LIBCUBLAS_DEV_VERSION} \
libnpp-devel-12-1-${NV_LIBNPP_DEV_VERSION} \
libnccl-devel-${NV_LIBNCCL_DEV_PACKAGE_VERSION} \
libnccl-${NV_LIBNCCL_DEV_PACKAGE_VERSION} \
libnccl-static-${NV_LIBNCCL_DEV_PACKAGE_VERSION} \
&& dnf clean all
ENV PATH=/usr/local/cuda/bin:$PATH
ENV LIBRARY_PATH="$CUDA_HOME/lib64/stubs"
FROM cuda-devel as python-installations
ARG WHEEL_VERSION
ARG USER
ARG USER_UID
RUN dnf install -y git wget && \
# perl-Net-SSLeay.x86_64 and server_key.pem are installed with git as dependencies
# Twistlock detects it as H severity: Private keys stored in image
rm -f /usr/share/doc/perl-Net-SSLeay/examples/server_key.pem && \
dnf clean all
USER ${USER}
WORKDIR /tmp
RUN --mount=type=cache,target=/home/${USER}/.cache/pip,uid=${USER_UID} \
python -m pip install --user build
COPY --chown=${USER}:root tuning tuning
COPY .git .git
COPY pyproject.toml pyproject.toml
# Build a wheel if PyPi wheel_version is empty else download the wheel from PyPi
RUN if [[ -z "${WHEEL_VERSION}" ]]; \
then python -m build --wheel --outdir /tmp; \
else pip download fms-hf-tuning==${WHEEL_VERSION} --dest /tmp --only-binary=:all: --no-deps; \
fi && \
ls /tmp/*.whl >/tmp/bdist_name
# Install from the wheel
RUN --mount=type=cache,target=/home/${USER}/.cache/pip,uid=${USER_UID} \
python -m pip install --user wheel && \
python -m pip install --user "$(head bdist_name)" && \
python -m pip install --user "$(head bdist_name)[flash-attn]" && \
# Clean up the wheel module. It's only needed by flash-attn install
python -m pip uninstall wheel build -y && \
# Cleanup the bdist whl file
rm $(head bdist_name) /tmp/bdist_name
USER root
#install conda
WORKDIR /opt/
RUN wget https://repo.anaconda.com/archive/Anaconda3-2024.02-1-Linux-x86_64.sh ; chmod 777 Anaconda3-2024.02-1-Linux-x86_64.sh
RUN ./Anaconda3-2024.02-1-Linux-x86_64.sh -b
ENV PATH=/root/anaconda3/bin:$PATH
RUN conda install cmake ninja
RUN conda install -c pytorch magma-cuda121 -y
#build pytorch
RUN git clone --recursive https://github.com/pytorch/pytorch
WORKDIR /opt/pytorch
RUN pip install -r requirements.txt
# if you are updating an existing checkout
#git submodule sync
#git submodule update --init --recursive
RUN USE_SYSTEM_NCCL=1 python3 setup.py build
RUN python setup.py install
## Final image ################################################
FROM release-base as release
ARG USER
ARG PYTHON_VERSION
RUN mkdir -p /licenses
COPY LICENSE /licenses/
RUN mkdir /app && \
chown -R $USER:0 /app /tmp && \
chmod -R g+rwX /app /tmp
# Copy scripts and default configs
COPY build/launch_training.py build/accelerate_launch.py fixtures/accelerate_fsdp_defaults.yaml /app/
COPY build/utils.py /app/build/
RUN chmod +x /app/launch_training.py /app/accelerate_launch.py
ENV FSDP_DEFAULTS_FILE_PATH="/app/accelerate_fsdp_defaults.yaml"
ENV SET_NUM_PROCESSES_TO_NUM_GPUS="True"
# Need a better way to address this hack
RUN touch /.aim_profile && \
chmod -R 777 /.aim_profile && \
mkdir /.cache && \
chmod -R 777 /.cache
WORKDIR /app
USER ${USER}
COPY --from=python-installations /home/${USER}/.local /home/${USER}/.local
ENV PYTHONPATH="/home/${USER}/.local/lib/python${PYTHON_VERSION}/site-packages"
CMD [ "python", "/app/accelerate_launch.py" ]
Hey guys, I have rhoai-2.10 running on my cluster which has 3 nodes with a single GPU per node, however, I want it to accept nccl env variables I will explain: currently, this default image can only use the podnetwork , and for that reason, I'm unable to utilize my Infiniband NICs, I need the option to set NCCL to use my IB NICs on the pod ( NCCL_SOCKET_IFNAME=net1,UCX_NET_DEVICES=net1 in my case ). I created the below docker file to compile pytorch with USE_SYSTEM_NCCL=1 , however, things got really messy, I'm wondering if anyone here knows of a simple solution to this issue.