robin-karlsson0 commented 2 years ago

Hi. Does anyone know the correct way to launch VISSL jobs using MPI on a supercomputer with multiple nodes/machines without SLURM?

Use case: 2 nodes w. 4 GPUs each

As far as I understand, the issue seems to be that instead of a single master node w. rank 0, 8 separate rank 0 nodes and 4 processes seem to be spawned. Is my launch script incorrect? Is there some conflict between the MPI and NCCL backends?

Hope someone knows an easy fix or correction to my configuration! Thank you :slightly_smiling_face:

Error (Address already in use - because there are several rank 0 nodes)


...
INFO 2021-10-17 13:41:47,500 tensorboard.py:  49: Tensorboard dir: exp/tb_logs
INFO 2021-10-17 13:41:47,500 tensorboard.py:  49: Tensorboard dir: exp/tb_logs
INFO 2021-10-17 13:41:47,500 tensorboard.py:  49: Tensorboard dir: exp/tb_logs
INFO 2021-10-17 13:41:47,500 tensorboard.py:  49: Tensorboard dir: exp/tb_logs
INFO 2021-10-17 13:41:47,500 tensorboard_hook.py:  88: Setting up SSL Tensorboard Hook...
INFO 2021-10-17 13:41:47,501 tensorboard_hook.py:  88: Setting up SSL Tensorboard Hook...
INFO 2021-10-17 13:41:47,501 tensorboard_hook.py:  88: Setting up SSL Tensorboard Hook...
INFO 2021-10-17 13:41:47,501 tensorboard_hook.py:  88: Setting up SSL Tensorboard Hook...
INFO 2021-10-17 13:41:47,502 tensorboard_hook.py: 101: Tensorboard config: log_params: True, log_params_freq: 310, log_params_gradients: True, log_activation_statistics: 0
INFO 2021-10-17 13:41:47,502 tensorboard_hook.py: 101: Tensorboard config: log_params: True, log_params_freq: 310, log_params_gradients: True, log_activation_statistics: 0
INFO 2021-10-17 13:41:47,502 tensorboard_hook.py: 101: Tensorboard config: log_params: True, log_params_freq: 310, log_params_gradients: True, log_activation_statistics: 0
INFO 2021-10-17 13:41:47,502 tensorboard_hook.py: 101: Tensorboard config: log_params: True, log_params_freq: 310, log_params_gradients: True, log_activation_statistics: 0
INFO 2021-10-17 13:41:47,502 trainer_main.py: 111: Using Distributed init method: tcp://10.102.1.55:29500, world_size: 8, rank: 0
INFO 2021-10-17 13:41:47,502 trainer_main.py: 111: Using Distributed init method: tcp://10.102.1.55:29500, world_size: 8, rank: 0
INFO 2021-10-17 13:41:47,502 tensorboard_hook.py:  88: Setting up SSL Tensorboard Hook...
INFO 2021-10-17 13:41:47,503 trainer_main.py: 111: Using Distributed init method: tcp://10.102.1.55:29500, world_size: 8, rank: 0
INFO 2021-10-17 13:41:47,503 tensorboard_hook.py:  88: Setting up SSL Tensorboard Hook...
INFO 2021-10-17 13:41:47,503 trainer_main.py: 111: Using Distributed init method: tcp://10.102.1.55:29500, world_size: 8, rank: 0
INFO 2021-10-17 13:41:47,503 tensorboard_hook.py: 101: Tensorboard config: log_params: True, log_params_freq: 310, log_params_gradients: True, log_activation_statistics: 0
INFO 2021-10-17 13:41:47,503 tensorboard_hook.py:  88: Setting up SSL Tensorboard Hook...
INFO 2021-10-17 13:41:47,503 tensorboard_hook.py:  88: Setting up SSL Tensorboard Hook...
INFO 2021-10-17 13:41:47,505 tensorboard_hook.py: 101: Tensorboard config: log_params: True, log_params_freq: 310, log_params_gradients: True, log_activation_statistics: 0
INFO 2021-10-17 13:41:47,507 tensorboard_hook.py: 101: Tensorboard config: log_params: True, log_params_freq: 310, log_params_gradients: True, log_activation_statistics: 0
INFO 2021-10-17 13:41:47,507 tensorboard_hook.py: 101: Tensorboard config: log_params: True, log_params_freq: 310, log_params_gradients: True, log_activation_statistics: 0
INFO 2021-10-17 13:41:47,507 trainer_main.py: 111: Using Distributed init method: tcp://10.102.1.55:29500, world_size: 8, rank: 0
INFO 2021-10-17 13:41:47,508 trainer_main.py: 111: Using Distributed init method: tcp://10.102.1.55:29500, world_size: 8, rank: 0
INFO 2021-10-17 13:41:47,508 trainer_main.py: 111: Using Distributed init method: tcp://10.102.1.55:29500, world_size: 8, rank: 0
INFO 2021-10-17 13:41:47,508 trainer_main.py: 111: Using Distributed init method: tcp://10.102.1.55:29500, world_size: 8, rank: 0

...

Traceback (most recent call last): File "tools/run_distributed_engines.py", line 61, in hydra_main(overrides=overrides) File "tools/run_distributed_engines.py", line 44, in hydra_main launch_distributed( File "/home/z44406a/projects/vissl/vissl/utils/distributed_launcher.py", line 135, in launch_distributed torch.multiprocessing.spawn( File "/home/z44406a/.pyenv/versions/3.8.11/envs/vissl/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/z44406a/.pyenv/versions/3.8.11/envs/vissl/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/home/z44406a/.pyenv/versions/3.8.11/envs/vissl/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/z44406a/.pyenv/versions/3.8.11/envs/vissl/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/home/z44406a/projects/vissl/vissl/utils/distributed_launcher.py", line 192, in _distributed_worker run_engine( File "/home/z44406a/projects/vissl/vissl/engines/engine_registry.py", line 86, in run_engine engine.run_engine( File "/home/z44406a/projects/vissl/vissl/engines/train.py", line 39, in run_engine train_main( File "/home/z44406a/projects/vissl/vissl/engines/train.py", line 127, in train_main trainer = SelfSupervisionTrainer( File "/home/z44406a/projects/vissl/vissl/trainer/trainer_main.py", line 85, in init self.setup_distributed(self.cfg.MACHINE.DEVICE == "gpu") File "/home/z44406a/projects/vissl/vissl/trainer/trainer_main.py", line 117, in setup_distributed torch.distributed.init_process_group( File "/home/z44406a/.pyenv/versions/3.8.11/envs/vissl/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 534, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/z44406a/.pyenv/versions/3.8.11/envs/vissl/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 184, in _tcp_rendezvous_handler store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout) File "/home/z44406a/.pyenv/versions/3.8.11/envs/vissl/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 158, in _create_c10d_store return TCPStore( RuntimeError: Address already in use


- My job launch script on the SC

!/bin/bash -x

PJM -L rscgrp=

PJM -L node=2

PJM -L elapse=5:00

PJM -j

PJM -S

module load gcc/8.4.0 module load python/3.8.11 module load cuda/11.2.1 module load cudnn/8.1.1 module load nccl/2.8.4 module load openmpi_cuda/4.0.5

source /home/###/.pyenv/versions/vissl/bin/activate

distributed setting

MASTER_ADDR=$(head -n 1 ${PJM_O_NODEINF})

mpirun -np 8 -machinefile ${PJM_O_NODEINF} -map-by ppr:4:node \ python tools/run_distributed_engines.py \ config=pretrain/swav/swav_8node_resnet.yaml \ config.DISTRIBUTED.RUN_ID="${MASTER_ADDR}":29500


- Resulting log file

[log01.txt](https://github.com/facebookresearch/vissl/files/7358824/log01.txt)

- VISSL config file specifications specified the config file according to instructions

DISTRIBUTED: BACKEND: nccl NUM_NODES: 2 # user sets this to number of machines to use NUM_PROC_PER_NODE: 4 # user sets this to number of gpus to use per machine INIT_METHOD: tcp # recommended if feasible otherwise RUN_ID: localhost:{port} # select the free port


Ref: https://vissl.readthedocs.io/en/v0.1.5/large_scale/distributed_training.html#train-on-multiple-machines

Unsure about the meaning of RUN_ID in terms of rank and process. https://github.com/facebookresearch/vissl/blob/ee91c2c3b6cb9a27bb396161116bb270ec9c86c5/vissl/utils/misc.py#L128
hints at this should be the master node address though.

Also, how is node rank and communication really setup within VISSL? How is the `node_id` and `local_rank` determined?
Ref: https://github.com/facebookresearch/vissl/blob/ee91c2c3b6cb9a27bb396161116bb270ec9c86c5/vissl/utils/env.py#L12

iseessel commented 2 years ago

Hi @robin-karlsson0 thanks for reaching out and glad to hear your using vissl for large-scale distributed training!

node_id should be set in the configuration here https://github.com/facebookresearch/vissl/blob/main/vissl/config/defaults.yaml#L49. It then gets passed in here: https://github.com/facebookresearch/vissl/blob/main/vissl/utils/distributed_launcher.py#L134 and finally the distributed rank gets set here: https://github.com/facebookresearch/vissl/blob/main/vissl/utils/distributed_launcher.py#L182.

I am not familiar with your setup, but the easiest fix would be if somehow you could retrieve the NODE_ID through an environment variable in your script. You can see for example how we get the env variable for SLURM: https://github.com/facebookresearch/vissl/blob/main/vissl/utils/slurm.py#L11, there should be something similar for your system.

    config.node_id=$NODE_ID_ENV_VARIABLE

You also have to specify the following for your training since you want to use mpi, 2 nodes and 4 gpus per node. I noticed you load nccl -- if your setup supports nccl training, I would recommend using NCCL instead. It should be faster and is recommended by pytorch. Vissl is also not well-tested on an mpi environment.

    config.DISTRIBUTED.BACKEND=mpi 
    config.DISTRIBUTED.NUM_NODES=2 \ 
    config.DISTRIBUTED.NUM_PROC_PER_NODE=4 \

And finally, I would recommend altering the sync batchnorm groupsize from 8 to 4. This will sync batchnorm mean and variance statistics across one node, which should improve performance as it doesn't require across node communication.

    config.MODEL.SYNC_BN_CONFIG.GROUP_SIZE=4

So your final script would be something like:

#!/bin/bash -x
#PJM -L rscgrp=####
#PJM -L node=2
#PJM -L elapse=5:00
#PJM -j
#PJM -S

module load gcc/8.4.0
module load python/3.8.11
module load cuda/11.2.1
module load cudnn/8.1.1
module load nccl/2.8.4
module load openmpi_cuda/4.0.5

source /home/###/.pyenv/versions/vissl/bin/activate

# distributed setting
MASTER_ADDR=$(head -n 1 ${PJM_O_NODEINF})

mpirun -np 8 -machinefile ${PJM_O_NODEINF} -map-by ppr:4:node \
    python tools/run_distributed_engines.py \
    config=pretrain/swav/swav_8node_resnet.yaml \
    config.DISTRIBUTED.RUN_ID="${MASTER_ADDR}":29500 \
    config.DISTRIBUTED.NUM_NODES=2 \ 
    config.DISTRIBUTED.NUM_PROC_PER_NODE=4 \
    config.MODEL.SYNC_BN_CONFIG.GROUP_SIZE=4 \
    config.DISTRIBUTED.BACKEND=mpi \
    config.node_id=$NODE_ID_ENV_VARIABLE

LMK if that makes sense and/or if you have any questions!

robin-karlsson0 commented 2 years ago

Hello @iseessel and thank you very much for your quick and helpful comments!

Happy to confirm that I managed to run VISSL on multiple nodes using the following approach :slightly_smiling_face: :tada:

Launch a script (1) job.sh which in turn launches a distributed training process on each node using MPI using another script (2) distributed_train.sh relying on the NCCL backend in VISSL.

One key was your feedback to substitute the config node_id value with the actually assigned node id.

Would you like me to extend the existing multi-node documentation page with more details as well as the scripts as a working example?

Ref: https://vissl.readthedocs.io/en/v0.1.5/large_scale/distributed_training.html#train-on-multiple-machines

(1) job.sh

#!/bin/bash -x
#PJM -L rscgrp=###
#PJM -L node=2
#PJM -L elapse=###
#PJM -j
#PJM -S
#PJM -x PJM_BEEOND=1

module load gcc/8.4.0
module load python/3.8.3
module load cuda/11.2.1
module load cudnn/8.1.1
module load nccl/2.8.4
module load openmpi_cuda/4.0.5

source /home/###/.pyenv/versions/vissl/bin/activate

# distributed setting
# -np == #nodes
mpirun \
    -np 2 \
    -npernode 1 \
    -bind-to none \
    -map-by slot \
    -x NCCL_DEBUG=INFO \
    -x NCCL_SOCKET_IFNAME="ib0" \
    -mca pml ob1 \
    -mca btl ^openib \
    -mca btl_tcp_if_include ib0 \
    -mca plm_rsh_agent /bin/pjrsh \
    -machinefile ${PJM_O_NODEINF} \
    /home/###/vissl/distributed_train.sh

(2) distributed_train.sh

#!/bin/bash

set -euo pipefail

# load modules
source /etc/profile.d/modules.sh
module load gcc/8.4.0
module load python/3.8.3
module load cuda/11.2.1
module load cudnn/8.1.1
module load nccl/2.8.4
module load openmpi_cuda/4.0.5

source /home/###/.pyenv/versions/vissl/bin/activate

# https://github.com/pytorch/pytorch/issues/37377
export MKL_THREADING_LAYER=GNU
export OMP_NUM_THREADS=1

# distributed setting
MY_ADDR=$(hostname -I | awk '{print $1}')
MASTER_ADDR=$(head -n 1 ${PJM_O_NODEINF})
NODE_RANK=$(cat ${PJM_O_NODEINF} | awk '{print NR-1 " " $1}' | grep ${MY_ADDR}$ | awk '{print $1}')
echo "MY_ADDR=${MY_ADDR}"
echo "MASTER_ADDR=${MASTER_ADDR}"
echo "NODE_RANK=${NODE_RANK}"

python tools/run_distributed_engines.py \
    config=pretrain/###.yaml \
    config.DISTRIBUTED.RUN_ID="${MASTER_ADDR}":29500 \
    node_id="${NODE_RANK}"

iseessel commented 2 years ago

@robin-karlsson0 Good idea -- glad it worked!

iseessel commented 2 years ago

@robin-karlsson0

Wondering if you had any learning about using VISSL on MPI? I have never used VISSL on MPI environment.

robin-karlsson0 commented 2 years ago

@iseessel Hi there. Thanks for the reminder. Got busy with new projects after our last communication, but will find time to document my experience by this weekend :slightly_smiling_face:

facebookresearch / vissl

Error launching multi-node jobs using MPI #453

!/bin/bash -x

PJM -L rscgrp=

PJM -L node=2

PJM -L elapse=5:00

PJM -j

PJM -S

distributed setting