Closed jasonkrone closed 2 months ago
A good starting point is to get the profiler traces and we can examine where the bottleneck is. Can you share them?
Certainly, just attached the profiler trace for rank0. To unzip run: gunzip compute-gpu-st-distributed-ml-1_132951.1714675937344945469.pt.trace.json.gz
.
I've also uploaded the profiler traces for all ranks to a public s3 bucket at s3://profiler-traces/all_ranks. To download those:
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
sudo apt install unzip
unzip awscliv2.zip
sudo ./aws/install
aws s3 sync s3://profiler-traces/all_ranks all_ranks --region us-east-1
compute-gpu-st-distributed-ml-1_132951.1714675937344945469.pt.trace.json.gz
I was able to get this resolved. I'm pretty sure it was a NCCL installation issue. I fixed it by moving from a conda install setup over to docker. Here's the Dockerfile that worked me:
# HPC setup taken from
# https://github.com/aws-samples/awsome-distributed-training/blob/0ef9be61cd8cbc0b9744fce51c2a388e1a95877c/3.test_cases/18.deepspeed/0.deepspeed.dockerfile
# pytorch version 2.2.0,
FROM nvcr.io/nvidia/pytorch:24.04-py3
ENV DEBIAN_FRONTEND=noninteractive
# The three must-be-built packages.
# Efa-installer>=1.29.0 required for nccl>=2.19.0 to avoid libfabric NCCL error.
ENV EFA_INSTALLER_VERSION=1.30.0
ENV AWS_OFI_NCCL_VERSION=1.8.1-aws
ENV NCCL_VERSION=2.19.3-1
ENV NCCL_TESTS_VERSION=master
RUN apt-get update -y
RUN apt-get remove -y --allow-change-held-packages \
libmlx5-1 ibverbs-utils libibverbs-dev libibverbs1
# We noticed that since 23.09, we can't just delete the whole /opt/hpcx/, otherwise `import torch`
# complains about missing libuc?.so.
RUN rm -rf /opt/hpcx/ompi \
&& rm -rf /usr/local/mpi \
&& rm -rf /opt/hpcx/nccl_rdma_sharp_plugin \
&& ldconfig
ENV OPAL_PREFIX=
RUN DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated \
git \
gcc \
vim \
kmod \
openssh-client \
openssh-server \
build-essential \
curl \
autoconf \
libtool \
gdb \
automake \
cmake \
apt-utils \
libhwloc-dev \
aptitude && \
DEBIAN_FRONTEND=noninteractive apt autoremove -y
# EFA
RUN apt-get update && \
cd /tmp && \
curl -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz && \
tar -xf aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz && \
cd aws-efa-installer && \
# ONLY add `--skip-kmod`, `--no-verify` and `--skip-limit-conf` flags to container image.
# Those three flags must NOT be used on the host.
#
# Explanations:
# - to build EFA in the Dockerfile, we added --skip-kmod and --no-verify. Without these flags,
# the Dockerfile will fail to build. If installing EFA on the host and not in a container,
# please remove these flags.
# - The --skip-limit-conf can be retained in Dockerfile, but it's redundant as the host already
# has these limits set by efa_installer.
./efa_installer.sh -y -g -d --skip-kmod --no-verify --skip-limit-conf && \
ldconfig && \
rm -rf /tmp/aws-efa-installer /var/lib/apt/lists/*
ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH
ENV PATH=/opt/amazon/efa/bin:/opt/amazon/openmpi/bin:$PATH
# NCCL EFA Plugin
RUN mkdir -p /tmp && \
cd /tmp && \
curl -LO https://github.com/aws/aws-ofi-nccl/archive/refs/tags/v${AWS_OFI_NCCL_VERSION}.tar.gz && \
tar -xzf /tmp/v${AWS_OFI_NCCL_VERSION}.tar.gz && \
rm /tmp/v${AWS_OFI_NCCL_VERSION}.tar.gz && \
mv aws-ofi-nccl-${AWS_OFI_NCCL_VERSION} aws-ofi-nccl && \
cd /tmp/aws-ofi-nccl && \
./autogen.sh && \
./configure --prefix=/opt/amazon/efa \
--with-libfabric=/opt/amazon/efa \
--with-cuda=/usr/local/cuda \
--enable-platform-aws \
--with-mpi=/opt/amazon/openmpi && \
make -j$(nproc) install && \
rm -rf /tmp/aws-ofi/nccl
# Do this to minimize the ld path env vars that users need to define when running this image.
RUN echo "/usr/local/lib" >> /etc/ld.so.conf.d/local.conf && \
echo "/opt/amazon/openmpi/lib" >> /etc/ld.so.conf.d/efa.conf && \
ldconfig
ENV OMPI_MCA_pml=^cm,ucx \
OMPI_MCA_btl=tcp,self \
OMPI_MCA_btl_tcp_if_exclude=lo,docker0 \
OPAL_PREFIX=/opt/amazon/openmpi \
# https://discuss.pytorch.org/t/nccl-network-is-unreachable-connection-refused-when-initializing-ddp/137352
# https://github.com/pytorch/pytorch/issues/68893
NCCL_SOCKET_IFNAME=^docker,lo
ENV LD_LIBRARY_PATH="/usr/local/lib:/usr/local/cuda/lib64:${LD_LIBRARY_PATH}"
# NCCL-tests: always good to include this as a diagnostic tool.
RUN git clone https://github.com/NVIDIA/nccl-tests.git /opt/nccl-tests \
&& cd /opt/nccl-tests \
&& git checkout ${NCCL_TESTS_VERSION} \
&& make MPI=1 \
MPI_HOME=/opt/amazon/openmpi \
CUDA_HOME=/usr/local/cuda \
NVCC_GENCODE="-gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_80,code=sm_80"
RUN pip install fire==0.5.0
RUN pip install pyarrow
RUN pip install transformers
RUN pip install ibm-fms
Summary
I'm not able to repoduce the throughput numbers quoted in the repo for the multi-node, 7B param model, H100 setup.
Specifically, on my cluster (8 nodes, 8xH100s per node) I'm seeing a throughput of ~1k token per gpu per sec (this print statement in train_utils.py), whereas the repo has 7.5k per gpu per sec.
My guess is I'm doing something wrong, but I'm not sure what - any tips / corrections would be super helpful!
Code Changes
The only change I made to the code was to hardcode the arguments at the bottom of main_training.py as follows:
Run Script
Here's the run.sbatch file I'm using to launch the training run
Cluster Details
My cluster is on AWS. As I mentioned, it's an 8 node cluster where each node has 8xH100 GPUS. The cluster is set up with EFA installed via awsparallelcluster.
Dependencies
I'm running with pytorch nightly + the dependencies in the requirements.txt file of the repo.