Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.95k stars 3.34k forks source link

Code hangs at `fabric.launch()` when nodes are ordered using SLURM #17228

Open shethdhvani opened 1 year ago

shethdhvani commented 1 year ago

Bug description

I am using a SLURM cluster with multiple nodes. Each node has 8 GPUs. I order the nodes using SLURM_HOSTFILE and --distribution=arbitrary (see below sbatch file) but the code hangs when I call fabric.launch().

Below is the python file. test.py

import socket
from typing import Sequence

from absl import app, logging
from lightning import fabric as lightning_fabric

def main(argv: Sequence[str]):
    del argv  # Unused.

    fabric = lightning_fabric.Fabric(precision="bf16",
                                     accelerator="gpu",
                                     devices=8,
                                     num_nodes=4)
    fabric.launch()
    logging.info("%s: %s", socket.gethostname(), fabric.global_rank)

if __name__ == '__main__':
    app.run(main)

Below is the sbatch file. test.sbatch

#!/bin/bash
#SBATCH --job-name=distributed_run
#SBATCH --nodes=4
#SBATCH --gpus-per-node=8
#SBATCH --ntasks-per-node=8
#SBATCH --exclusive
export PMI_DEBUG=1

source <path to mpivars.sh>

echo $SLURM_NODELIST

export SLURM_HOSTFILE=$(mktemp)
python3 /home/ubuntu/node_ordering.py --host-order-file <file with nodes ordered> | tee $SLURM_HOSTFILE

cat $SLURM_HOSTFILE

srun -v --mpi=pmi2 --gpus-per-node=$SBATCH_GPUS_PER_NODE \
     --ntasks-per-node=$SLURM_NTASKS_PER_NODE \
     --distribution=arbitrary \
     python3 /home/ubuntu/test.py

The node_ordering.py sanitizes the ordered hosts file and then passes it to SLURM_HOSTFILE variable. The ordered hosts file has a host one per line equal to the number of tasks. Thus one host is repeated 8 times and then the next one is repeated 8 times and so on. For this test, total 32 lines. 8 lines per node.

import argparse

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--host-order-file",
        help="Path to file with host ordering, one host per line.",
    )
    args = parser.parse_args()

    with open(args.host_order_file) as h:
        global_host_order = [host.strip() for host in h]

    print("\n".join(global_host_order))

Command to run: sbatch test.sbatch Output

srun: defined options
srun: -------------------- --------------------
srun: (null)              : compute-permanent-node-[583,752,801,869]
srun: distribution        : arbitrary
srun: gpus-per-node       :
srun: jobid               : 232
srun: job-name            : distributed_run
srun: mpi                 : pmi2
srun: nodes               : 4
srun: ntasks              : 32
srun: ntasks-per-node     : 8
srun: verbose             : 1
srun: -------------------- --------------------
srun: end of defined options
srun: jobid 232: nodes(4):`compute-permanent-node-[583,752,801,869]', cpu counts: 64(x4)
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: launching StepId=232.0 on host compute-permanent-node-583, 8 tasks: [16-23]
srun: launching StepId=232.0 on host compute-permanent-node-752, 8 tasks: [24-31]
srun: launching StepId=232.0 on host compute-permanent-node-801, 8 tasks: [0-7]
srun: launching StepId=232.0 on host compute-permanent-node-869, 8 tasks: [8-15]
srun: route/default: init: route default plugin loaded
srun: launch/slurm: _task_start: Node compute-permanent-node-752, 8 tasks started
srun: launch/slurm: _task_start: Node compute-permanent-node-583, 8 tasks started
srun: launch/slurm: _task_start: Node compute-permanent-node-869, 8 tasks started
srun: launch/slurm: _task_start: Node compute-permanent-node-801, 8 tasks started
Initializing distributed: GLOBAL_RANK: 18, MEMBER: 19/32
Initializing distributed: GLOBAL_RANK: 21, MEMBER: 22/32
Initializing distributed: GLOBAL_RANK: 22, MEMBER: 23/32
Initializing distributed: GLOBAL_RANK: 16, MEMBER: 17/32
Initializing distributed: GLOBAL_RANK: 17, MEMBER: 18/32
Initializing distributed: GLOBAL_RANK: 19, MEMBER: 20/32
Initializing distributed: GLOBAL_RANK: 23, MEMBER: 24/32
Initializing distributed: GLOBAL_RANK: 20, MEMBER: 21/32
Initializing distributed: GLOBAL_RANK: 11, MEMBER: 12/32
Initializing distributed: GLOBAL_RANK: 26, MEMBER: 27/32
Initializing distributed: GLOBAL_RANK: 10, MEMBER: 11/32
Initializing distributed: GLOBAL_RANK: 13, MEMBER: 14/32
Initializing distributed: GLOBAL_RANK: 8, MEMBER: 9/32
Initializing distributed: GLOBAL_RANK: 15, MEMBER: 16/32
Initializing distributed: GLOBAL_RANK: 12, MEMBER: 13/32
Initializing distributed: GLOBAL_RANK: 14, MEMBER: 15/32
Initializing distributed: GLOBAL_RANK: 9, MEMBER: 10/32
Initializing distributed: GLOBAL_RANK: 30, MEMBER: 31/32
Initializing distributed: GLOBAL_RANK: 24, MEMBER: 25/32
Initializing distributed: GLOBAL_RANK: 25, MEMBER: 26/32
Initializing distributed: GLOBAL_RANK: 29, MEMBER: 30/32
Initializing distributed: GLOBAL_RANK: 28, MEMBER: 29/32
Initializing distributed: GLOBAL_RANK: 27, MEMBER: 28/32
Initializing distributed: GLOBAL_RANK: 31, MEMBER: 32/32
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/32
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/32
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/32
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/32
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/32
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/32
/home/ubuntu/.local/lib/python3.8/site-packages/lightning/fabric/connector.py:562: UserWarning: bf16 is supported for historical reasons but its usage is discouraged. Please set your precision to bf16-mixed instead!
  rank_zero_warn(
Using bfloat16 Automatic Mixed Precision (AMP)
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/32
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/32

It hangs after this above initialization.

How to reproduce the bug

All the files are above.

Error messages and logs

See above

Environment

Current environment ``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): Fabric #- PyTorch Lightning Version (e.g., 1.5.0): 2.0.0 #- Lightning App Version (e.g., 0.5.2): #- PyTorch Version (e.g., 2.0): - pytorch-lightning: 2.0.0 - torch: 2.0.0 #- Python version (e.g., 3.9): 3.8.10 #- OS (e.g., Linux): Ubuntu 20.04.1 #- CUDA/cuDNN version: 11.7 - nvidia-cublas-cu11: 11.10.3.66 - nvidia-cuda-cupti-cu11: 11.7.101 - nvidia-cuda-nvrtc-cu11: 11.7.99 - nvidia-cuda-runtime-cu11: 11.7.99 - nvidia-cudnn-cu11: 8.5.0.96 - nvidia-cufft-cu11: 10.9.0.58 - nvidia-curand-cu11: 10.2.10.91 - nvidia-cusolver-cu11: 11.4.0.1 - nvidia-cusparse-cu11: 11.7.4.91 - nvidia-nccl-cu11: 2.14.3 - nvidia-nvtx-cu11: 11.7.91 #- GPU models and configuration: NVIDIA A100 40 GB #- How you installed Lightning(`conda`, `pip`, source): pip #- Running environment of LightningApp (e.g. local, cloud): cloud ```

More info

No response

cc @awaelchli @carmocca @justusschock

shethdhvani commented 1 year ago

I updated the logging in the test.py as below and now I can see that even though processes are in the node order that I want but NODE_RANK isn't.

import socket
from typing import Sequence

from absl import app, logging
from lightning import fabric as lightning_fabric

def main(argv: Sequence[str]):
    del argv  # Unused.

    fabric = lightning_fabric.Fabric(precision="bf16",
                                     accelerator="gpu",
                                     devices=8,
                                     num_nodes=4)
    logging.info("Hostname: %s, Global Rank: %s, World Size: %s, Local Rank: %s, Node Rank: %s", socket.gethostname(), fabric.global_rank, fabric.world_size, fabric.local_rank, fabric.node_rank)
    fabric.launch()
    logging.info("Hostname: %s, Global Rank: %s, World Size: %s, Local Rank: %s, Node Rank: %s", socket.gethostname(), fabric.global_rank, fabric.world_size, fabric.local_rank, fabric.node_rank)

if __name__ == '__main__':
    app.run(main)

Output

srun: defined options
srun: -------------------- --------------------
srun: (null)              : compute-permanent-node-[583,752,801,869]
srun: distribution        : arbitrary
srun: gpus-per-node       :
srun: jobid               : 247
srun: job-name            : distributed_run
srun: mpi                 : pmi2
srun: nodes               : 4
srun: ntasks              : 32
srun: ntasks-per-node     : 8
srun: verbose             : 1
srun: -------------------- --------------------
srun: end of defined options
srun: jobid 247: nodes(4):`compute-permanent-node-[583,752,801,869]', cpu counts: 64(x4)
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: launching StepId=247.0 on host compute-permanent-node-583, 8 tasks: [16-23]
srun: launching StepId=247.0 on host compute-permanent-node-752, 8 tasks: [24-31]
srun: launching StepId=247.0 on host compute-permanent-node-801, 8 tasks: [0-7]
srun: launching StepId=247.0 on host compute-permanent-node-869, 8 tasks: [8-15]
srun: route/default: init: route default plugin loaded
srun: launch/slurm: _task_start: Node compute-permanent-node-869, 8 tasks started
srun: launch/slurm: _task_start: Node compute-permanent-node-801, 8 tasks started
srun: launch/slurm: _task_start: Node compute-permanent-node-583, 8 tasks started
srun: launch/slurm: _task_start: Node compute-permanent-node-752, 8 tasks started
I0330 18:16:54.303619 139697084143424 test_dhvani.py:14] Hostname: compute-permanent-node-869, Global Rank: 11, World Size: 32, Local Rank: 3, Node Rank: 3
I0330 18:16:54.305584 140643384964928 test_dhvani.py:14] Hostname: compute-permanent-node-801, Global Rank: 2, World Size: 32, Local Rank: 2, Node Rank: 2
Initializing distributed: GLOBAL_RANK: 11, MEMBER: 12/32
I0330 18:16:54.305157 139915590850368 test_dhvani.py:14] Hostname: compute-permanent-node-869, Global Rank: 9, World Size: 32, Local Rank: 1, Node Rank: 3
Initializing distributed: GLOBAL_RANK: 9, MEMBER: 10/32
I0330 18:16:54.311784 140398114289472 test_dhvani.py:14] Hostname: compute-permanent-node-869, Global Rank: 10, World Size: 32, Local Rank: 2, Node Rank: 3
Initializing distributed: GLOBAL_RANK: 10, MEMBER: 11/32
I0330 18:16:54.312869 139936163600192 test_dhvani.py:14] Hostname: compute-permanent-node-869, Global Rank: 14, World Size: 32, Local Rank: 6, Node Rank: 3
Initializing distributed: GLOBAL_RANK: 14, MEMBER: 15/32
I0330 18:16:54.313498 139840547825472 test_dhvani.py:14] Hostname: compute-permanent-node-869, Global Rank: 12, World Size: 32, Local Rank: 4, Node Rank: 3
Initializing distributed: GLOBAL_RANK: 12, MEMBER: 13/32
I0330 18:16:54.314639 139895322158912 test_dhvani.py:14] Hostname: compute-permanent-node-869, Global Rank: 15, World Size: 32, Local Rank: 7, Node Rank: 3
Initializing distributed: GLOBAL_RANK: 15, MEMBER: 16/32
I0330 18:16:54.315909 140191386740544 test_dhvani.py:14] Hostname: compute-permanent-node-869, Global Rank: 8, World Size: 32, Local Rank: 0, Node Rank: 3
Initializing distributed: GLOBAL_RANK: 8, MEMBER: 9/32
I0330 18:16:54.317279 140034056619840 test_dhvani.py:14] Hostname: compute-permanent-node-869, Global Rank: 13, World Size: 32, Local Rank: 5, Node Rank: 3
Initializing distributed: GLOBAL_RANK: 13, MEMBER: 14/32
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/32
I0330 18:16:54.314607 140708625536832 test_dhvani.py:14] Hostname: compute-permanent-node-801, Global Rank: 1, World Size: 32, Local Rank: 1, Node Rank: 2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/32
I0330 18:16:54.316093 140261499483968 test_dhvani.py:14] Hostname: compute-permanent-node-801, Global Rank: 4, World Size: 32, Local Rank: 4, Node Rank: 2
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/32
/home/ubuntu/.local/lib/python3.8/site-packages/lightning/fabric/connector.py:562: UserWarning: bf16 is supported for historical reasons but its usage is discouraged. Please set your precision to bf16-mixed instead!
  rank_zero_warn(
Using bfloat16 Automatic Mixed Precision (AMP)
I0330 18:16:54.317313 140502209234752 test_dhvani.py:14] Hostname: compute-permanent-node-801, Global Rank: 0, World Size: 32, Local Rank: 0, Node Rank: 2
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/32
I0330 18:16:54.318425 140415136458560 test_dhvani.py:14] Hostname: compute-permanent-node-801, Global Rank: 5, World Size: 32, Local Rank: 5, Node Rank: 2
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/32
I0330 18:16:54.319544 140615691507520 test_dhvani.py:14] Hostname: compute-permanent-node-801, Global Rank: 3, World Size: 32, Local Rank: 3, Node Rank: 2
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/32
I0330 18:16:54.320640 139715299657536 test_dhvani.py:14] Hostname: compute-permanent-node-801, Global Rank: 6, World Size: 32, Local Rank: 6, Node Rank: 2
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/32
I0330 18:16:54.321194 140346986075968 test_dhvani.py:14] Hostname: compute-permanent-node-801, Global Rank: 7, World Size: 32, Local Rank: 7, Node Rank: 2
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/32
I0330 18:16:54.363456 140436730341184 test_dhvani.py:14] Hostname: compute-permanent-node-752, Global Rank: 25, World Size: 32, Local Rank: 1, Node Rank: 1
Initializing distributed: GLOBAL_RANK: 25, MEMBER: 26/32
I0330 18:16:54.366786 139718088111936 test_dhvani.py:14] Hostname: compute-permanent-node-752, Global Rank: 30, World Size: 32, Local Rank: 6, Node Rank: 1
I0330 18:16:54.366882 140216424646464 test_dhvani.py:14] Hostname: compute-permanent-node-752, Global Rank: 27, World Size: 32, Local Rank: 3, Node Rank: 1
Initializing distributed: GLOBAL_RANK: 30, MEMBER: 31/32
Initializing distributed: GLOBAL_RANK: 27, MEMBER: 28/32
I0330 18:16:54.367706 140578643445568 test_dhvani.py:14] Hostname: compute-permanent-node-752, Global Rank: 29, World Size: 32, Local Rank: 5, Node Rank: 1
Initializing distributed: GLOBAL_RANK: 29, MEMBER: 30/32
I0330 18:16:54.369673 140712776693568 test_dhvani.py:14] Hostname: compute-permanent-node-752, Global Rank: 24, World Size: 32, Local Rank: 0, Node Rank: 1
Initializing distributed: GLOBAL_RANK: 24, MEMBER: 25/32
I0330 18:16:54.370116 140692250466112 test_dhvani.py:14] Hostname: compute-permanent-node-752, Global Rank: 31, World Size: 32, Local Rank: 7, Node Rank: 1
Initializing distributed: GLOBAL_RANK: 31, MEMBER: 32/32
I0330 18:16:54.370689 139901112284992 test_dhvani.py:14] Hostname: compute-permanent-node-752, Global Rank: 28, World Size: 32, Local Rank: 4, Node Rank: 1
Initializing distributed: GLOBAL_RANK: 28, MEMBER: 29/32
I0330 18:16:54.371742 139848063510336 test_dhvani.py:14] Hostname: compute-permanent-node-752, Global Rank: 26, World Size: 32, Local Rank: 2, Node Rank: 1
Initializing distributed: GLOBAL_RANK: 26, MEMBER: 27/32
I0330 18:16:54.549028 140545898977088 test_dhvani.py:14] Hostname: compute-permanent-node-583, Global Rank: 21, World Size: 32, Local Rank: 5, Node Rank: 0
Initializing distributed: GLOBAL_RANK: 21, MEMBER: 22/32
I0330 18:16:54.559207 139919473252160 test_dhvani.py:14] Hostname: compute-permanent-node-583, Global Rank: 20, World Size: 32, Local Rank: 4, Node Rank: 0
Initializing distributed: GLOBAL_RANK: 20, MEMBER: 21/32
I0330 18:16:54.559575 140489117607744 test_dhvani.py:14] Hostname: compute-permanent-node-583, Global Rank: 18, World Size: 32, Local Rank: 2, Node Rank: 0
Initializing distributed: GLOBAL_RANK: 18, MEMBER: 19/32
I0330 18:16:54.560622 140384903231296 test_dhvani.py:14] Hostname: compute-permanent-node-583, Global Rank: 16, World Size: 32, Local Rank: 0, Node Rank: 0
Initializing distributed: GLOBAL_RANK: 16, MEMBER: 17/32
I0330 18:16:54.562446 140220584195904 test_dhvani.py:14] Hostname: compute-permanent-node-583, Global Rank: 23, World Size: 32, Local Rank: 7, Node Rank: 0
Initializing distributed: GLOBAL_RANK: 23, MEMBER: 24/32
I0330 18:16:54.564320 139797709399872 test_dhvani.py:14] Hostname: compute-permanent-node-583, Global Rank: 19, World Size: 32, Local Rank: 3, Node Rank: 0
Initializing distributed: GLOBAL_RANK: 19, MEMBER: 20/32
I0330 18:16:54.568092 140099039987520 test_dhvani.py:14] Hostname: compute-permanent-node-583, Global Rank: 17, World Size: 32, Local Rank: 1, Node Rank: 0
Initializing distributed: GLOBAL_RANK: 17, MEMBER: 18/32
I0330 18:16:54.571164 140285401782080 test_dhvani.py:14] Hostname: compute-permanent-node-583, Global Rank: 22, World Size: 32, Local Rank: 6, Node Rank: 0
Initializing distributed: GLOBAL_RANK: 22, MEMBER: 23/32
ipoletaev commented 1 year ago

Would appreciate any help from the lightning-ai team here!

shethdhvani commented 1 year ago

Was able to get this working by updating the environment variable SLURM_NODELIST with the value as list of ordered nodes in the test python file. The NODE_RANK is still in ascending order but the code works and does not hang. Can this be made default behavior? test.py

import socket
from typing import Sequence

from absl import app, logging
from lightning import fabric as lightning_fabric
import os

def main(argv: Sequence[str]):
    if len(argv) > 1:
        with open(argv[1]) as h:
            slurm_list = [host.strip() for host in h]
        slurm_nodelist = (",".join(slurm_list))
        os.environ["SLURM_NODELIST"] = slurm_nodelist
    fabric = lightning_fabric.Fabric(precision="bf16",
                                     accelerator="gpu",
                                     devices=8,
                                     num_nodes=4)
    fabric.launch()
    logging.info("Hostname: %s, Global Rank: %s, World Size: %s, Local Rank: %s, Node Rank: %s", socket.gethostname(), fabric.global_rank, fabric.world_size, fabric.local_rank, fabric.node_rank)

if __name__ == '__main__':
    app.run(main)

test.sbatch

#!/bin/bash
#SBATCH --job-name=distributed_run
#SBATCH --nodes=4
#SBATCH --gpus-per-node=8
#SBATCH --ntasks-per-node=8
#SBATCH --exclusive
export PMI_DEBUG=1

source <path to mpivars.sh>

echo $SLURM_NODELIST

export SLURM_HOSTFILE=$(mktemp)
python3 /home/ubuntu/node_ordering.py --host-order-file <file with nodes ordered> | tee $SLURM_HOSTFILE

cat $SLURM_HOSTFILE

srun -v --mpi=pmi2 --gpus-per-node=$SBATCH_GPUS_PER_NODE \
     --ntasks-per-node=$SLURM_NTASKS_PER_NODE \
     --distribution=arbitrary \
     python3 /home/ubuntu/test.py <ordered nodelist>

Also, updating SLURM_NODELIST in the sbatch with the ordered hosts is not being propagated in the python environment.

awaelchli commented 1 year ago

Hello Sorry for the late reply, I am a bit overloaded. I didn't fully understand what the problem is, but I get that you fixed the issue by reordering some node names.

I can point you to the code where the nodelist gets parsed: https://github.com/Lightning-AI/lightning/blob/6cbc9dfb9174ec38909a3d0e46f584ce109777cf/src/lightning/fabric/plugins/environments/slurm.py#L55-L60

We only use this to determine who is the main node. So maybe not the right one was returned there? When you ask

Can this be made default behavior?

I'm not sure what we could change apart from the piece of code that I linked above. Our SLURM code in Lightning only reads and parses the environment variables as they are set in the environment.

@Queuecumber Have you ever seen an issue like this?

Queuecumber commented 1 year ago

No but I've never used SLURM_HOSTFILE either, I assume there's a good reason for having a specific ordering of the hosts?

shethdhvani commented 1 year ago

Hi,

I had also referred to the above code you mentioned. Looking at that, I was able to determine that it's not getting the correct root address (main node) that I want which I specify in SLURM_HOSTFILE environment variable. It is using the SLURM_NODELIST environment variable. And thus the correct one that we want as per our node order is not returned.

So, I updated the SLURM_NODELIST variable with the node order I want in the sbatch file. However, Lightning did not pick up that node order which I export in sbatch file but rather the nodes are arranged in ascending order. Thus, the master node and root address is different than what it should be and is hanging.

To resolve this, currently am updating the SLURM_NODELIST environment variable in the python file before I launch fabric.

Can this be made default behavior?

I'm not sure what we could change apart from the piece of code that I linked above. Our SLURM code in Lightning only reads and parses the environment variables as they are set in the environment.

When I set and export SLURM_NODELIST environment variable in the sbatch file, it is not read by Lightning in the same node order. When I read SLURM_NODELIST environment variable in the python file, it is sorted in alphabetical/ascending order. Somewhere in between, after launching the python application using srun and before launching fabric, this SLURM_NODELIST environment variable is sorted. Can you check where is this happening? To summarize, the default behavior should be that when Lightning reads the SLURM_NODELIST environment variable, it should be in the same node order as exported.

Why is this ordering important? We want to group nodes on the same switch that are close to each other in order to minimize the back and forth of communication and traffic between GPUs. Thus, we need a specific ordering and not what SLURM provides by default. Based on the documentation from SLURM here, https://slurm.schedmd.com/srun.html#OPT_arbitrary, when we set --distribution=arbitrary and export SLURM_HOSTFILE, the processes are allocated in-order as listed in the file exported via SLURM_HOSTFILE. Based on the output below,

srun: launching StepId=232.0 on host compute-permanent-node-583, 8 tasks: [16-23]
srun: launching StepId=232.0 on host compute-permanent-node-752, 8 tasks: [24-31]
srun: launching StepId=232.0 on host compute-permanent-node-801, 8 tasks: [0-7]
srun: launching StepId=232.0 on host compute-permanent-node-869, 8 tasks: [8-15]

we can see that the processes were started in the order we want (compute-permanent-node-801, compute-permanent-node-869, compute-permanent-node-583, compute-permanent-node-752). However, since the master node and root address picked up by Lightning (i.e. compute-permanent-node-583 which is the first node in alphabetical/ascending order) do not match, the node waits for connections and the code hangs.

carmocca commented 1 year ago

@shethdhvani

When I set and export SLURM_NODELIST environment variable in the sbatch file, it is not read by Lightning in the same node order. When I read SLURM_NODELIST environment variable in the python file, it is sorted in alphabetical/ascending order. Somewhere in between, after launching the python application using srun and before launching fabric, this SLURM_NODELIST environment variable is sorted. Can you check where is this happening? To summarize, the default behavior should be that when Lightning reads the SLURM_NODELIST environment variable, it should be in the same node order as exported.

What does the following print?

print(1, os.environ["SLURM_NODELIST"])
import lightning as L
print(2, os.environ["SLURM_NODELIST"])
fabric = L.Fabric()
print(3, os.environ["SLURM_NODELIST"])
fabric.launch()
print(4, os.environ["SLURM_NODELIST"])

Because if the environment variable is not what you expected at (1), then this is not an issue with Fabric.

shethdhvani commented 1 year ago

Output

SLURM_NODELIST in sbatch
compute-permanent-node-801,compute-permanent-node-869,compute-permanent-node-583,compute-permanent-node-752
I0411 01:28:05.820399 139934585009984 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.830992 140313237600064 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.836308 140190379489088 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.842156 139834860619584 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.820408 139805815334720 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.820411 140254645638976 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.820414 140681562302272 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.820474 140266930190144 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.820590 140401730049856 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.830717 139651106867008 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.830717 140062394136384 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.831131 139849786435392 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.831359 139667380000576 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.831362 140463569766208 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.831353 140564078982976 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.831432 139740111906624 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.834348 139866054903616 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.834354 140523141875520 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.836309 140410258900800 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.836318 139695552149312 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.836344 140025252558656 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.836350 139996747781952 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.836328 140198719661888 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.838040 139933862451008 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.839607 140016763078464 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.842153 140065187370816 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.842159 140656298329920 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.842340 139645271131968 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.845100 140492815607616 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.845136 140087875577664 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.845237 140600840021824 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:05.845268 140215295838016 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]
I0411 01:28:09.728551 140198719661888 instantiator.py:21] Created a temporary directory at /tmp/tmprplgg96u
I0411 01:28:09.728687 139933862451008 instantiator.py:21] Created a temporary directory at /tmp/tmpc165vtz5
I0411 01:28:09.728770 140198719661888 instantiator.py:76] Writing /tmp/tmprplgg96u/_remote_module_non_scriptable.py
I0411 01:28:09.728728 140410258900800 instantiator.py:21] Created a temporary directory at /tmp/tmp9utc03_h
I0411 01:28:09.728772 140190379489088 instantiator.py:21] Created a temporary directory at /tmp/tmpbv5f_vd3
I0411 01:28:09.728827 139996747781952 instantiator.py:21] Created a temporary directory at /tmp/tmpbh45v8iz
I0411 01:28:09.728912 139933862451008 instantiator.py:76] Writing /tmp/tmpc165vtz5/_remote_module_non_scriptable.py
I0411 01:28:09.728956 140410258900800 instantiator.py:76] Writing /tmp/tmp9utc03_h/_remote_module_non_scriptable.py
I0411 01:28:09.728926 139695552149312 instantiator.py:21] Created a temporary directory at /tmp/tmpxnl6vm_i
I0411 01:28:09.729001 140190379489088 instantiator.py:76] Writing /tmp/tmpbv5f_vd3/_remote_module_non_scriptable.py
I0411 01:28:09.729064 139996747781952 instantiator.py:76] Writing /tmp/tmpbh45v8iz/_remote_module_non_scriptable.py
I0411 01:28:09.729193 139695552149312 instantiator.py:76] Writing /tmp/tmpxnl6vm_i/_remote_module_non_scriptable.py
I0411 01:28:09.753081 140016763078464 instantiator.py:21] Created a temporary directory at /tmp/tmp7q_6dluc
I0411 01:28:09.753335 140016763078464 instantiator.py:76] Writing /tmp/tmp7q_6dluc/_remote_module_non_scriptable.py
I0411 01:28:09.754417 140025252558656 instantiator.py:21] Created a temporary directory at /tmp/tmph9eswg6p
I0411 01:28:09.754645 140025252558656 instantiator.py:76] Writing /tmp/tmph9eswg6p/_remote_module_non_scriptable.py
I0411 01:28:09.915080 139933862451008 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:09.915109 140190379489088 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:09.915364 139695552149312 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:09.915387 139996747781952 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:09.915420 140198719661888 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:09.915462 140410258900800 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:09.926319 140025252558656 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:09.927112 140016763078464 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.024788 140198719661888 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 28, MEMBER: 29/32
I0411 01:28:10.025221 140198719661888 distributed.py:244] Initializing distributed: GLOBAL_RANK: 28, MEMBER: 29/32
I0411 01:28:10.035025 140190379489088 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.035391 139933862451008 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 26, MEMBER: 27/32
I0411 01:28:10.035459 140190379489088 distributed.py:244] Initializing distributed: GLOBAL_RANK: 26, MEMBER: 27/32
INFO: Initializing distributed: GLOBAL_RANK: 27, MEMBER: 28/32
I0411 01:28:10.035774 139933862451008 distributed.py:244] Initializing distributed: GLOBAL_RANK: 27, MEMBER: 28/32
I0411 01:28:10.036301 139996747781952 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 25, MEMBER: 26/32
I0411 01:28:10.036739 139996747781952 distributed.py:244] Initializing distributed: GLOBAL_RANK: 25, MEMBER: 26/32
I0411 01:28:10.040041 139695552149312 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.040362 140410258900800 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 31, MEMBER: 32/32
I0411 01:28:10.040475 139695552149312 distributed.py:244] Initializing distributed: GLOBAL_RANK: 31, MEMBER: 32/32
INFO: Initializing distributed: GLOBAL_RANK: 30, MEMBER: 31/32
I0411 01:28:10.040753 140410258900800 distributed.py:244] Initializing distributed: GLOBAL_RANK: 30, MEMBER: 31/32
I0411 01:28:10.043000 140025252558656 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 24, MEMBER: 25/32
I0411 01:28:10.043380 140025252558656 distributed.py:244] Initializing distributed: GLOBAL_RANK: 24, MEMBER: 25/32
I0411 01:28:10.045079 140016763078464 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 29, MEMBER: 30/32
I0411 01:28:10.045457 140016763078464 distributed.py:244] Initializing distributed: GLOBAL_RANK: 29, MEMBER: 30/32
I0411 01:28:10.055817 139834860619584 instantiator.py:21] Created a temporary directory at /tmp/tmp5z_1mle7
I0411 01:28:10.056031 139834860619584 instantiator.py:76] Writing /tmp/tmp5z_1mle7/_remote_module_non_scriptable.py
I0411 01:28:10.055978 140492815607616 instantiator.py:21] Created a temporary directory at /tmp/tmphx1w2i5w
I0411 01:28:10.056020 140087875577664 instantiator.py:21] Created a temporary directory at /tmp/tmph1axr7bu
I0411 01:28:10.056086 140656298329920 instantiator.py:21] Created a temporary directory at /tmp/tmpbtu1pybz
I0411 01:28:10.056122 140215295838016 instantiator.py:21] Created a temporary directory at /tmp/tmppnwzd61k
I0411 01:28:10.056216 140492815607616 instantiator.py:76] Writing /tmp/tmphx1w2i5w/_remote_module_non_scriptable.py
I0411 01:28:10.056210 139645271131968 instantiator.py:21] Created a temporary directory at /tmp/tmpga3epsek
I0411 01:28:10.056287 140087875577664 instantiator.py:76] Writing /tmp/tmph1axr7bu/_remote_module_non_scriptable.py
I0411 01:28:10.056327 140656298329920 instantiator.py:76] Writing /tmp/tmpbtu1pybz/_remote_module_non_scriptable.py
I0411 01:28:10.056292 140600840021824 instantiator.py:21] Created a temporary directory at /tmp/tmpg6kc0d6y
I0411 01:28:10.056352 140215295838016 instantiator.py:76] Writing /tmp/tmppnwzd61k/_remote_module_non_scriptable.py
I0411 01:28:10.056359 140065187370816 instantiator.py:21] Created a temporary directory at /tmp/tmpdiif_8yy
I0411 01:28:10.056455 139645271131968 instantiator.py:76] Writing /tmp/tmpga3epsek/_remote_module_non_scriptable.py
I0411 01:28:10.056519 140600840021824 instantiator.py:76] Writing /tmp/tmpg6kc0d6y/_remote_module_non_scriptable.py
I0411 01:28:10.056641 140065187370816 instantiator.py:76] Writing /tmp/tmpdiif_8yy/_remote_module_non_scriptable.py
I0411 01:28:10.095409 140254645638976 instantiator.py:21] Created a temporary directory at /tmp/tmphiwovtx8
I0411 01:28:10.095534 140266930190144 instantiator.py:21] Created a temporary directory at /tmp/tmp0pidnazj
I0411 01:28:10.095622 140254645638976 instantiator.py:76] Writing /tmp/tmphiwovtx8/_remote_module_non_scriptable.py
I0411 01:28:10.095559 140681562302272 instantiator.py:21] Created a temporary directory at /tmp/tmpnuhh0vfn
I0411 01:28:10.095789 140266930190144 instantiator.py:76] Writing /tmp/tmp0pidnazj/_remote_module_non_scriptable.py
I0411 01:28:10.095806 140681562302272 instantiator.py:76] Writing /tmp/tmpnuhh0vfn/_remote_module_non_scriptable.py
I0411 01:28:10.095930 140062394136384 instantiator.py:21] Created a temporary directory at /tmp/tmpl4khddlm
I0411 01:28:10.096006 139805815334720 instantiator.py:21] Created a temporary directory at /tmp/tmpdn_0iaqx
I0411 01:28:10.096036 139651106867008 instantiator.py:21] Created a temporary directory at /tmp/tmp1u9fakp1
I0411 01:28:10.096163 140062394136384 instantiator.py:76] Writing /tmp/tmpl4khddlm/_remote_module_non_scriptable.py
I0411 01:28:10.096229 139805815334720 instantiator.py:76] Writing /tmp/tmpdn_0iaqx/_remote_module_non_scriptable.py
I0411 01:28:10.096264 139651106867008 instantiator.py:76] Writing /tmp/tmp1u9fakp1/_remote_module_non_scriptable.py
I0411 01:28:10.116687 140523141875520 instantiator.py:21] Created a temporary directory at /tmp/tmprddb59lz
I0411 01:28:10.116709 139667380000576 instantiator.py:21] Created a temporary directory at /tmp/tmpt_rzto8r
I0411 01:28:10.116808 140313237600064 instantiator.py:21] Created a temporary directory at /tmp/tmptros88by
I0411 01:28:10.116859 139740111906624 instantiator.py:21] Created a temporary directory at /tmp/tmpp44gimpg
I0411 01:28:10.116935 140523141875520 instantiator.py:76] Writing /tmp/tmprddb59lz/_remote_module_non_scriptable.py
I0411 01:28:10.116916 140564078982976 instantiator.py:21] Created a temporary directory at /tmp/tmpb37khjlu
I0411 01:28:10.116986 139667380000576 instantiator.py:76] Writing /tmp/tmpt_rzto8r/_remote_module_non_scriptable.py
I0411 01:28:10.116973 140463569766208 instantiator.py:21] Created a temporary directory at /tmp/tmpk3zlhyvb
I0411 01:28:10.117001 139866054903616 instantiator.py:21] Created a temporary directory at /tmp/tmpdom2z7p4
I0411 01:28:10.117049 140313237600064 instantiator.py:76] Writing /tmp/tmptros88by/_remote_module_non_scriptable.py
I0411 01:28:10.117102 139740111906624 instantiator.py:76] Writing /tmp/tmpp44gimpg/_remote_module_non_scriptable.py
I0411 01:28:10.117094 139849786435392 instantiator.py:21] Created a temporary directory at /tmp/tmpx3j2w3nl
I0411 01:28:10.117154 140564078982976 instantiator.py:76] Writing /tmp/tmpb37khjlu/_remote_module_non_scriptable.py
I0411 01:28:10.117220 140463569766208 instantiator.py:76] Writing /tmp/tmpk3zlhyvb/_remote_module_non_scriptable.py
I0411 01:28:10.117239 139866054903616 instantiator.py:76] Writing /tmp/tmpdom2z7p4/_remote_module_non_scriptable.py
I0411 01:28:10.117322 139849786435392 instantiator.py:76] Writing /tmp/tmpx3j2w3nl/_remote_module_non_scriptable.py
I0411 01:28:10.122470 139934585009984 instantiator.py:21] Created a temporary directory at /tmp/tmp82t5et1g
I0411 01:28:10.122733 139934585009984 instantiator.py:76] Writing /tmp/tmp82t5et1g/_remote_module_non_scriptable.py
I0411 01:28:10.124773 140401730049856 instantiator.py:21] Created a temporary directory at /tmp/tmpu5hyl99r
I0411 01:28:10.125073 140401730049856 instantiator.py:76] Writing /tmp/tmpu5hyl99r/_remote_module_non_scriptable.py
I0411 01:28:10.222768 139834860619584 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.222867 139645271131968 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.223172 140215295838016 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.223875 140600840021824 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.225889 140656298329920 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.229725 140492815607616 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.231614 140065187370816 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.237291 140087875577664 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.269181 140254645638976 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.269363 139805815334720 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.269368 140681562302272 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.269441 140266930190144 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.269446 140062394136384 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.269479 139651106867008 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.292197 139934585009984 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.292506 140401730049856 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.315618 140523141875520 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.315655 139740111906624 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.315751 140313237600064 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.315754 140463569766208 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.315768 139866054903616 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.315842 140564078982976 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.315889 139667380000576 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.315935 139849786435392 github_fabric.py:13] 2: compute-permanent-node-[583,752,801,869]
I0411 01:28:10.335435 139834860619584 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/32
I0411 01:28:10.335878 139834860619584 distributed.py:244] Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/32
I0411 01:28:10.339282 139645271131968 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/32
I0411 01:28:10.339728 139645271131968 distributed.py:244] Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/32
I0411 01:28:10.340401 140215295838016 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/32
I0411 01:28:10.340803 140215295838016 distributed.py:244] Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/32
I0411 01:28:10.340913 140600840021824 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/32
I0411 01:28:10.341299 140600840021824 distributed.py:244] Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/32
I0411 01:28:10.345465 140656298329920 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/32
I0411 01:28:10.345892 140656298329920 distributed.py:244] Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/32
I0411 01:28:10.347674 140492815607616 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/32
I0411 01:28:10.348047 140492815607616 distributed.py:244] Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/32
I0411 01:28:10.348870 140065187370816 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/32
I0411 01:28:10.349262 140065187370816 distributed.py:244] Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/32
I0411 01:28:10.350788 140087875577664 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/32
I0411 01:28:10.351154 140087875577664 distributed.py:244] Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/32
I0411 01:28:10.378894 140254645638976 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 19, MEMBER: 20/32
I0411 01:28:10.379357 140254645638976 distributed.py:244] Initializing distributed: GLOBAL_RANK: 19, MEMBER: 20/32
I0411 01:28:10.389169 140266930190144 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 18, MEMBER: 19/32
I0411 01:28:10.389594 140266930190144 distributed.py:244] Initializing distributed: GLOBAL_RANK: 18, MEMBER: 19/32
I0411 01:28:10.390151 140062394136384 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 22, MEMBER: 23/32
I0411 01:28:10.390582 140062394136384 distributed.py:244] Initializing distributed: GLOBAL_RANK: 22, MEMBER: 23/32
I0411 01:28:10.392222 139805815334720 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 20, MEMBER: 21/32
I0411 01:28:10.392601 139805815334720 distributed.py:244] Initializing distributed: GLOBAL_RANK: 20, MEMBER: 21/32
I0411 01:28:10.393430 139651106867008 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 16, MEMBER: 17/32
I0411 01:28:10.393808 139651106867008 distributed.py:244] Initializing distributed: GLOBAL_RANK: 16, MEMBER: 17/32
I0411 01:28:10.394481 140681562302272 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 23, MEMBER: 24/32
I0411 01:28:10.394875 140681562302272 distributed.py:244] Initializing distributed: GLOBAL_RANK: 23, MEMBER: 24/32
I0411 01:28:10.399302 139934585009984 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 17, MEMBER: 18/32
I0411 01:28:10.399709 139934585009984 distributed.py:244] Initializing distributed: GLOBAL_RANK: 17, MEMBER: 18/32
I0411 01:28:10.401868 140401730049856 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 21, MEMBER: 22/32
I0411 01:28:10.402272 140401730049856 distributed.py:244] Initializing distributed: GLOBAL_RANK: 21, MEMBER: 22/32
I0411 01:28:10.437248 140523141875520 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 9, MEMBER: 10/32
I0411 01:28:10.437697 140523141875520 distributed.py:244] Initializing distributed: GLOBAL_RANK: 9, MEMBER: 10/32
I0411 01:28:10.440183 140564078982976 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 14, MEMBER: 15/32
I0411 01:28:10.440588 140564078982976 distributed.py:244] Initializing distributed: GLOBAL_RANK: 14, MEMBER: 15/32
I0411 01:28:10.442291 139740111906624 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 8, MEMBER: 9/32
I0411 01:28:10.442701 139740111906624 distributed.py:244] Initializing distributed: GLOBAL_RANK: 8, MEMBER: 9/32
I0411 01:28:10.443188 140463569766208 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 13, MEMBER: 14/32
I0411 01:28:10.443591 140463569766208 distributed.py:244] Initializing distributed: GLOBAL_RANK: 13, MEMBER: 14/32
I0411 01:28:10.443732 140313237600064 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 11, MEMBER: 12/32
I0411 01:28:10.444110 140313237600064 distributed.py:244] Initializing distributed: GLOBAL_RANK: 11, MEMBER: 12/32
I0411 01:28:10.444300 139866054903616 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 10, MEMBER: 11/32
I0411 01:28:10.444680 139866054903616 distributed.py:244] Initializing distributed: GLOBAL_RANK: 10, MEMBER: 11/32
I0411 01:28:10.446431 139849786435392 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 15, MEMBER: 16/32
I0411 01:28:10.446923 139849786435392 distributed.py:244] Initializing distributed: GLOBAL_RANK: 15, MEMBER: 16/32
I0411 01:28:10.446914 139667380000576 github_fabric.py:15] 3: compute-permanent-node-[583,752,801,869]
INFO: Initializing distributed: GLOBAL_RANK: 12, MEMBER: 13/32
I0411 01:28:10.447352 139667380000576 distributed.py:244] Initializing distributed: GLOBAL_RANK: 12, MEMBER: 13/32

From the above output:

SLURM_NODELIST in sbatch
compute-permanent-node-801,compute-permanent-node-869,compute-permanent-node-583,compute-permanent-node-752

and output at (1) in python file

I0411 01:28:05.820399 139934585009984 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]

Both not same.

awaelchli commented 1 year ago

What does the code snippet that @carmocca posted print? Again, Lightning does not modify the SLURM_NODELIST.

shethdhvani commented 1 year ago

As mentioned above,

SLURM_NODELIST in sbatch
compute-permanent-node-801,compute-permanent-node-869,compute-permanent-node-583,compute-permanent-node-752

and then output at (1) in python file as @carmocca mentioned I0411 01:28:05.820399 139934585009984 github_fabric.py:11] 1: compute-permanent-node-[583,752,801,869]

Both are different. If Lightning does not modify SLURM_NODELIST, so is this issue with Slurm then?

awaelchli commented 1 year ago

Likely, or a problem with how you configure the parameters in the sbatch script (I don't know and your config looks fine). You can replace the Lightning script with a regular script and you will see the same output of SLURM_NODELIST being printed. You could also try printing SLURM_JOB_NODELIST which according to the doc should be the same as SLURM_NODELIST.

Maybe if the SLURM_HOSTFILE is provided, then the SLURM_NODELIST is not important. Then perhaps the implementation here can be changed to read the hostfile and get the main address from the first host in the file: https://github.com/Lightning-AI/lightning/blob/1aa23267abd161fb02fe5c2775c34d006d1e336e/src/lightning/fabric/plugins/environments/slurm.py#L55-L60

shethdhvani commented 1 year ago

Below is my sbatch file.

#!/bin/bash
#SBATCH --job-name=distributed_run
#SBATCH --nodes=4
#SBATCH --gpus-per-node=8
#SBATCH --ntasks-per-node=8
#SBATCH --exclusive
export PMI_DEBUG=1

source <path to mpivars.sh>

export SLURM_HOSTFILE=$(mktemp)
python3 /home/ubuntu/node_ordering.py --host-order-file <file with nodes ordered> | tee $SLURM_HOSTFILE

cat $SLURM_HOSTFILE

srun -v --mpi=pmi2 --gpus-per-node=$SBATCH_GPUS_PER_NODE \
     --ntasks-per-node=$SLURM_NTASKS_PER_NODE \
     --distribution=arbitrary \
     python3 /home/ubuntu/test.py

Based on the srun documentation here --> https://slurm.schedmd.com/srun.html#OPT_arbitrary

The arbitrary method of distribution will allocate processes in-order as listed in file designated by the environment variable SLURM_HOSTFILE. If this variable is listed it will over ride any other method specified. If not set the method will default to block. Inside the hostfile must contain at minimum the number of hosts requested and be one per line or comma separated. If specifying a task count (-n, --ntasks=), your tasks will be laid out on the nodes in the order of the file.

We have been using the SLURM_HOSTFILE and distribution=arbitrary to order the nodes based on the ordering in SLURM_HOSTFILE. So can we update the logic to pick the root address/MASTER NODE from the SLURM_HOSTFILE if it is specified and distribution=arbitrary is specified?