Closed szhengac closed 1 year ago
This should work, the blank line is probably the culprit. Also there are several things you want to change here, and you might want to install the pytorch enroot hook rather than setting the torch distributed variables manually (see https://github.com/NVIDIA/pyxis/wiki/Setup#enroot-configuration)
#SBATCH --job-name=megatron
#SBATCH --partition=h100
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-task=1
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=8
#SBATCH --exclusive
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NCCL_DEBUG=WARN
srun -l \
--container-image /run/enroot/pt.sqsh \
--container-mount-home \
--container-workspace /workspace/Megatron-LM \
--output=$DIR/logs/%x_%j_$DATETIME.log bash -c "${run_cmd}"
@3XX0 You mean the blank line between #SBATCH? Removing it does not work and it still just launch one process. And sudo cp /usr/share/enroot/hooks.d/50-slurm-pmi.sh /usr/share/enroot/hooks.d/50-slurm-pytorch.sh /etc/enroot/hooks.d
does not work for me. I still have to use export to manually set them up.
@3XX0 Thanks for the help. I finally find the mystery. I accidentally added another line between the first #SBATCH
and #!/bin/bash
. Regarding enroot hook, enroot path in my system is in /usr/local/etc/enroot
instead of the default one /etc/enroot
PS. --container-workspace
complains there is no such argument
I see, you can add the extra hook there instead then.
The hook triggers only when it detects PYTORCH_VERSION, so that's probably what you're missing in your container image (you can always edit the hook or add this to /usr/local/etc/enroot/environ.d
too).
Sorry I meant --container-workdir
, not workspace.
Hi,
I am testing Megatron-LM script with pyxis and Slurm on a single box. One thing bothers me is that
SBATCH
arguments do not help launch multiple processes, and I have to add the same arguments tosrun
again to make it work. The following shows the basic skeleton of my script.You can see that I basically add all the
#SBATCH
arguments again tosrun
. If I do not do it, only a single process is launched on GPU 0, and script is hanging at distributed initialization waiting for other workers to join.