Multi node training not working on H100 gpus

Tristan-Kosciuch commented 2 months ago

Hello,

We're trying to run musicgen training/fine-tuning from the audiocraft repo using dora. We've been able to run single-node training with dora run -d solver, When running the above using torchrun on multiple nodes training also fails with a thread deadlock error, and the same is true for running with dora launch. We're running on GCP with NVIDIA H100 instances. I wonder if the H100s are not compatible with some of audiocraft's dependencies.

When attempting to use dora grid the process quickly exits with "FAI" status and a warning that we cannot change config values.

    raise ConfigCompositionException(
hydra.errors.ConfigCompositionException: Could not override 'lr'.
To append to your config use +lr=0.01

This is the grid we are using

from itertools import product
from dora import Explorer, Launcher

@Explorer
def explorer(launcher: Launcher):

    sub = launcher.bind(lr=0.01)  # bind some parameter value, in a new launcher
    sub.slurm_(gpus=16)  # all jobs scheduled with `sub` will use 8 gpus.

    sub()  # Job with lr=0.01 and 16 gpus.
    sub.bind_(epochs=40)  # in-place version of bind()
    sub.slurm(partition="h100")(batch_size=16)  # lr=0.01, 16 gpus, h100, bs=16 and epochs=40.

A few warnings with dora launch stand out. We have a script to start dora launch, that we run with sh launch.sh (we don't use sbatch/srun here, should we?). Here's the script.

#!/bin/sh

# these logging exports don't do much
export HYDRA_FULL_ERROR=1
export CUDBG_USE_LEGACY_DEBUGGER=1
export NVLOG_CONFIG_FILE=${HOME}/nvlog.stdout.config
export NVLOG_CONFIG_FILE=${HOME}/nvlog.config
export WANDB_MODE=offline

cd /home/$USER/audiocraft/

export AUDIOCRAFT_DORA_DIR=/projects/$USER/

export TEAM=$TEAM
export USER=$USER
export NCCL_DEBUG=DEBUG
export LOGLEVEL=INFO

dora launch -a --no_git_save -p h100 -g 16 solver=musicgen/musicgen_32khz \
model/lm/model_scale=small \
conditioner=text2music \
dset=audio/data_32khz \
dataset.num_workers=0 \
continue_from=//pretrained/facebook/musicgen-small \
dataset.valid.num_samples=16 \
dataset.batch_size=64 \
schedule.cosine.warmup=500 \
optim.optimizer=dadam \
optim.lr=1e-4 \
optim.epochs=30 \
slurm.setup=[". /home/$USER/anaconda3/etc/profile.d/conda.sh","conda activate torch_ac"]
optim.updates_per_epoch=1000

If I run this from our slurm login node, I get this message. This makes me question if training is being attempted on the login node:

/home/$USER/anaconda3/envs/torch_ac/lib/python3.10/site-packages/torch/cuda/__init__.py:611: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")

If we run the script from one of the GPU nodes, there is no NVML issue but training times out with GPU usage stuck at 100% and eventually crashes with a message about backwards pass for gradients, I'll post that log here soon.

Any info is helpful, even an example of how to do dora launch on a slurm cluster (is it launched with sbatch?)

nischalj10 commented 2 months ago

Following. Were you able to resolve this?

Tristan-Kosciuch commented 2 months ago

Unfortunately we were not able to, we're still working on it. Have you managed it?

nischalj10 commented 2 months ago

Nope

facebookresearch / audiocraft

Multi node training not working on H100 gpus #446