Mutli-GPU fails without adding node/gpu arguments to srun

szhengac commented 1 year ago

Hi,

I am testing Megatron-LM script with pyxis and Slurm on a single box. One thing bothers me is that SBATCH arguments do not help launch multiple processes, and I have to add the same arguments to srun again to make it work. The following shows the basic skeleton of my script.

#SBATCH --job-name=megatron
#SBATCH --partition=h100

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-task=1
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=8
#SBATCH --exclusive

srun -l \
     --nodes=1 \
     --gpus=8 \
     --gpus-per-task=1 \
     --ntasks-per-node=8 \
     --cpus-per-task=28 \
     --container-image /run/enroot/pt.sqsh \
     --container-mounts /home/szhengac:/home/szhengac \
     --export="CUDA_DEVICE_MAX_CONNECTIONS=1,PWD=/workspace/Megatron-LM,NCCL_DEBUG=WARN,MASTER_PORT=9783,WORLD_SIZE=8,MASTER_ADDR=127.0.0.1" \
     --output=$DIR/logs/%x_%j_$DATETIME.log bash -c "${run_cmd}"

You can see that I basically add all the #SBATCH arguments again to srun. If I do not do it, only a single process is launched on GPU 0, and script is hanging at distributed initialization waiting for other workers to join.

3XX0 commented 1 year ago

This should work, the blank line is probably the culprit. Also there are several things you want to change here, and you might want to install the pytorch enroot hook rather than setting the torch distributed variables manually (see https://github.com/NVIDIA/pyxis/wiki/Setup#enroot-configuration)

#SBATCH --job-name=megatron
#SBATCH --partition=h100
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-task=1
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=8
#SBATCH --exclusive

export CUDA_DEVICE_MAX_CONNECTIONS=1
export NCCL_DEBUG=WARN

srun -l \
     --container-image /run/enroot/pt.sqsh \
     --container-mount-home \
     --container-workspace /workspace/Megatron-LM \
     --output=$DIR/logs/%x_%j_$DATETIME.log bash -c "${run_cmd}"

szhengac commented 1 year ago

@3XX0 You mean the blank line between #SBATCH? Removing it does not work and it still just launch one process. And sudo cp /usr/share/enroot/hooks.d/50-slurm-pmi.sh /usr/share/enroot/hooks.d/50-slurm-pytorch.sh /etc/enroot/hooks.d does not work for me. I still have to use export to manually set them up.

szhengac commented 1 year ago

@3XX0 Thanks for the help. I finally find the mystery. I accidentally added another line between the first #SBATCH and #!/bin/bash. Regarding enroot hook, enroot path in my system is in /usr/local/etc/enroot instead of the default one /etc/enroot

PS. --container-workspace complains there is no such argument

3XX0 commented 1 year ago

I see, you can add the extra hook there instead then. The hook triggers only when it detects PYTORCH_VERSION, so that's probably what you're missing in your container image (you can always edit the hook or add this to /usr/local/etc/enroot/environ.d too).

Sorry I meant --container-workdir, not workspace.

NVIDIA / pyxis

Mutli-GPU fails without adding node/gpu arguments to srun #111