foundation-model-stack / fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
https://pytorch.org/docs/stable/fsdp.html
Apache License 2.0
114 stars 18 forks source link

Not Able to Reproduce Multi-Node Throughput for 7B Model on 8 Node H100 Cluster #78

Closed jasonkrone closed 2 months ago

jasonkrone commented 2 months ago

Summary

I'm not able to repoduce the throughput numbers quoted in the repo for the multi-node, 7B param model, H100 setup.

Specifically, on my cluster (8 nodes, 8xH100s per node) I'm seeing a throughput of ~1k token per gpu per sec (this print statement in train_utils.py), whereas the repo has 7.5k per gpu per sec.

My guess is I'm doing something wrong, but I'm not sure what - any tips / corrections would be super helpful!

Code Changes

The only change I made to the code was to hardcode the arguments at the bottom of main_training.py as follows:

    kwargs = {
        "use_dummy_dataset": True,
        "ckpt_load_path": "/home/ubuntu/ckpt",
        "ckpt_save_path": "/home/ubuntu/ckpt",
        "data_path": "/lustre/bluepile-processing/rel0_7/tokens/llama2/high_quality_rerun_fuzzy_deduped",
        "fsdp_activation_checkpointing": False,
        "selective_checkpointing": 1,
        "sharding_strategy": "hsdp",
        "low_cpu_fsdp": True,
        "batch_size": 2,
        "report_interval": 5,
        "checkpoint_interval": 20000,
        "use_torch_compile": True,
        "use_profiler": False,
        "seq_length": 4096,
    }
    main(**kwargs)

Run Script

Here's the run.sbatch file I'm using to launch the training run

#!/bin/bash

#SBATCH --nodes=8 # number of nodes to use
#SBATCH --job-name=ibm-fms-bench
#SBATCH --output=R-%x.%j.out

# On AWS, the EFA and OFI paths enable NCCL to use optimized networking.
export LD_LIBRARY_PATH=/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/cuda:/usr/local/cuda/targets/x86_64-linux/lib/:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:$LD_LIBRARY_PATH

## Plenty of EFA level variables
## Comment out for non-efa instances (G4d, P3)
## For G5.12x, Comment out RDMA and Fork safe
## For G4dn and other G5, comment out all
export FI_EFA_USE_DEVICE_RDMA=1 # use for p4d
export FI_EFA_FORK_SAFE=1
export FI_LOG_LEVEL=1
export FI_PROVIDER=efa
export NCCL_DEBUG=INFO
## Switching SYNC_MEMOPS to zero can boost throughput with FSDP
## Disables CU_POINTER_ATTRIBUTE_SYNC_MEMOPS
## Reduces memory synchronizations
## https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html
export FI_EFA_SET_CUDA_SYNC_MEMOPS=0

GPUS_PER_NODE=8

declare -a TORCHRUN_ARGS=(
    --nproc_per_node=$GPUS_PER_NODE
    --nnodes=$SLURM_JOB_NUM_NODES
    --rdzv_id=$SLURM_JOB_ID
    --rdzv_backend=c10d
    --rdzv_endpoint=$(hostname)
)

export TORCHRUN=/fsx/jpt/pt_nightly/bin/torchrun
export TRAIN_SCRIPT=./main_training.py

srun -l ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT

Cluster Details

My cluster is on AWS. As I mentioned, it's an 8 node cluster where each node has 8xH100 GPUS. The cluster is set up with EFA installed via awsparallelcluster.

Dependencies

I'm running with pytorch nightly + the dependencies in the requirements.txt file of the repo.

raghukiran1224 commented 2 months ago

A good starting point is to get the profiler traces and we can examine where the bottleneck is. Can you share them?

jasonkrone commented 2 months ago

Certainly, just attached the profiler trace for rank0. To unzip run: gunzip compute-gpu-st-distributed-ml-1_132951.1714675937344945469.pt.trace.json.gz.

I've also uploaded the profiler traces for all ranks to a public s3 bucket at s3://profiler-traces/all_ranks. To download those:

  1. install s3 cli tool
    curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
    sudo apt install unzip
    unzip awscliv2.zip
    sudo ./aws/install
  2. download the files
    aws s3 sync s3://profiler-traces/all_ranks all_ranks --region us-east-1

compute-gpu-st-distributed-ml-1_132951.1714675937344945469.pt.trace.json.gz

jasonkrone commented 2 months ago

I was able to get this resolved. I'm pretty sure it was a NCCL installation issue. I fixed it by moving from a conda install setup over to docker. Here's the Dockerfile that worked me:

# HPC setup taken from
# https://github.com/aws-samples/awsome-distributed-training/blob/0ef9be61cd8cbc0b9744fce51c2a388e1a95877c/3.test_cases/18.deepspeed/0.deepspeed.dockerfile

# pytorch version 2.2.0,
FROM nvcr.io/nvidia/pytorch:24.04-py3
ENV DEBIAN_FRONTEND=noninteractive

# The three must-be-built packages.
# Efa-installer>=1.29.0 required for nccl>=2.19.0 to avoid libfabric NCCL error.
ENV EFA_INSTALLER_VERSION=1.30.0
ENV AWS_OFI_NCCL_VERSION=1.8.1-aws
ENV NCCL_VERSION=2.19.3-1
ENV NCCL_TESTS_VERSION=master

RUN apt-get update -y
RUN apt-get remove -y --allow-change-held-packages \
                      libmlx5-1 ibverbs-utils libibverbs-dev libibverbs1

# We noticed that since 23.09, we can't just delete the whole /opt/hpcx/, otherwise `import torch`
# complains about missing libuc?.so.
RUN rm -rf /opt/hpcx/ompi \
    && rm -rf /usr/local/mpi \
    && rm -rf /opt/hpcx/nccl_rdma_sharp_plugin \
    && ldconfig
ENV OPAL_PREFIX=
RUN DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated \
    git \
    gcc \
    vim \
    kmod \
    openssh-client \
    openssh-server \
    build-essential \
    curl \
    autoconf \
    libtool \
    gdb \
    automake \
    cmake \
    apt-utils \
    libhwloc-dev \
    aptitude && \
    DEBIAN_FRONTEND=noninteractive apt autoremove -y

# EFA
RUN apt-get update && \
    cd /tmp && \
    curl -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz  && \
    tar -xf aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz && \
    cd aws-efa-installer && \
    # ONLY add `--skip-kmod`, `--no-verify` and `--skip-limit-conf` flags to container image.
    # Those three flags must NOT be used on the host.
    #
    # Explanations:
    # - to build EFA in the Dockerfile, we added --skip-kmod and --no-verify. Without these flags,
    #   the Dockerfile will fail to build. If installing EFA on the host and not in a container,
    #   please remove these flags.
    # - The --skip-limit-conf can be retained in Dockerfile, but it's redundant as the host already
    #   has these limits set by efa_installer.
    ./efa_installer.sh -y -g -d --skip-kmod --no-verify --skip-limit-conf && \
    ldconfig && \
    rm -rf /tmp/aws-efa-installer /var/lib/apt/lists/*
ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH
ENV PATH=/opt/amazon/efa/bin:/opt/amazon/openmpi/bin:$PATH

# NCCL EFA Plugin
RUN mkdir -p /tmp && \
    cd /tmp && \
    curl -LO https://github.com/aws/aws-ofi-nccl/archive/refs/tags/v${AWS_OFI_NCCL_VERSION}.tar.gz && \
    tar -xzf /tmp/v${AWS_OFI_NCCL_VERSION}.tar.gz && \
    rm /tmp/v${AWS_OFI_NCCL_VERSION}.tar.gz && \
    mv aws-ofi-nccl-${AWS_OFI_NCCL_VERSION} aws-ofi-nccl && \
    cd /tmp/aws-ofi-nccl && \
    ./autogen.sh && \
    ./configure --prefix=/opt/amazon/efa \
        --with-libfabric=/opt/amazon/efa \
        --with-cuda=/usr/local/cuda \
        --enable-platform-aws \
        --with-mpi=/opt/amazon/openmpi && \
    make -j$(nproc) install && \
    rm -rf /tmp/aws-ofi/nccl

# Do this to minimize the ld path env vars that users need to define when running this image.
RUN echo "/usr/local/lib"      >> /etc/ld.so.conf.d/local.conf && \
    echo "/opt/amazon/openmpi/lib" >> /etc/ld.so.conf.d/efa.conf && \
    ldconfig

ENV OMPI_MCA_pml=^cm,ucx            \
    OMPI_MCA_btl=tcp,self           \
    OMPI_MCA_btl_tcp_if_exclude=lo,docker0 \
    OPAL_PREFIX=/opt/amazon/openmpi \
    # https://discuss.pytorch.org/t/nccl-network-is-unreachable-connection-refused-when-initializing-ddp/137352
    # https://github.com/pytorch/pytorch/issues/68893
    NCCL_SOCKET_IFNAME=^docker,lo

ENV LD_LIBRARY_PATH="/usr/local/lib:/usr/local/cuda/lib64:${LD_LIBRARY_PATH}"

# NCCL-tests: always good to include this as a diagnostic tool.
RUN git clone https://github.com/NVIDIA/nccl-tests.git /opt/nccl-tests \
    && cd /opt/nccl-tests \
    && git checkout ${NCCL_TESTS_VERSION} \
    && make MPI=1 \
    MPI_HOME=/opt/amazon/openmpi \
    CUDA_HOME=/usr/local/cuda \
    NVCC_GENCODE="-gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_80,code=sm_80"

RUN pip install fire==0.5.0
RUN pip install pyarrow
RUN pip install transformers
RUN pip install ibm-fms