`_is_slurm_managing_tasks(self)` incorrectly returns `False` when using SLURM's `--gpu-bind=closest`

casparvl commented 2 years ago

🐛 Bug && steps to reproduce

I'm running the standard boring_model (or any model), with some minor changes to the arguments:

    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1000,
        limit_val_batches=1,
        limit_test_batches=1,
        num_sanity_val_steps=0,
        max_epochs=100,
        enable_model_summary=False,
        accelerator='gpu',
        gpus=-1,
        strategy='ddp',
    )

on our SLURM cluster like so:

srun -p gpu --cpus-per-task=18 --exclusive --mem-per-gpu=120000M --nodes=1 --ntasks-per-node=4 --gpus-per-node=4 --gpu-bind=closest python boring_model.py --accelerator 'gpu' --devices -1 --strategy ddp

however, it seems to spawns 8 processes: 4 'worlds' of size two. While running, each GPU is running two processes. I also don't see "Multiprocessing is handled by SLURM" in the output, which is expected from here.

Expected behavior

I expect srun to launch 4 tasks, and PyTorch Lightnings SLURMEnvironment to make these into a single world of size 4.

Environment

* CUDA:
        - GPU:
                - NVIDIA A100-SXM4-40GB
                - NVIDIA A100-SXM4-40GB
                - NVIDIA A100-SXM4-40GB
                - NVIDIA A100-SXM4-40GB
        - available:         True
        - version:           11.5
* Packages:
        - numpy:             1.20.3
        - pyTorch_debug:     False
        - pyTorch_version:   1.11.0+cu115
        - pytorch-lightning: 1.6.3
        - tqdm:              4.64.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.9.5
        - version:           #1 SMP Wed Apr 6 13:48:37 EDT 2022

Additional context

The issue first poped up in a real-world code where I was combining PT Lightning with Hydra, and showed up as a rather strange error message from Hydra. This thread contains a lot more analysis on that. However, I'll repeat/summarize the essentials here.

The problem arrises because _is_slurm_managing_tasks(self) in accelerator_connector.py is returning False, despite the tasks being launched by SLURM. The result is that PyTorch Lightning will itself try to spawn tasks, from each of the four processes already launched by srun. As we'll see later, each process sees 2 GPUs, and thus launches two tasks, thus I end up with my final total of 8 tasks.

Since PyTorch Lightning thinks it's running outside of a SLURM context, that also causes the hydra error. This is because Lightning will use subprocess_script.py to try and launch a new process. Here it will add hydra.run.dir to the original argument list and call the original command again, but since the original command is the _submit.py script (and not the PyTorch lightning training script) this fails in an unrecognized argument error.

That part aside, the real problem is: why doesn't Lightning recognize that slurm is managing the tasks? Well, that's because of this line. Lightning assumes two things here:

That self._parallel_devices contains the number of physical parallel devices that is available per node
That each node has the same number of parallel devices.

I won't go deeper into number (2), since that assumption isn't broken here - I just want to remark that _is_slurm_managing_tasks() logic would also break in that case. In our case, the problem is (1).

Let's see what's going on step by step. self._parallel_devices is set here, based on a call to self.accelerator.get_parallel_devices(self._devices_flag). In turn, the self._devices_flag is set here in my case, based on a call to self.accelerator.auto_device_count(). Since my accelerator is a GPU, self.accelerator is a GPUAccelerator object. Thus, it's calling this function. And this, suprisingly, is where the problem lies. auto_device_count will return the number of devices that that process has access to. That is potentially not the same as the number of _physical parallel devices that is available per node` in a SLURM allocation.

To demonstrate, consider the following run:

srun -p gpu --cpus-per-task=18 --exclusive --mem-per-gpu=120000M --nodes=1 --ntasks-per-node=4 --gpus-per-node=4 --gpu-bind=closest env | grep CUDA_VISIBLE_DEVICES
srun: job 1225150 queued and waiting for resources
srun: job 1225150 has been allocated resources
CUDA_VISIBLE_DEVICES=2,3
CUDA_VISIBLE_DEVICES=2,3
CUDA_VISIBLE_DEVICES=0,1
CUDA_VISIBLE_DEVICES=0,1

As you can see, SLURM launches 4 tasks. Each of those tasks get access to a subset of the devices available on the node. In this case, auto_device_count will return 2 - the number of GPUs that that particular process has access to. The reason it only has access to two is because of the --gpu-bind=closest argument (see the documentation of srun). This will make sure that GPUs are bound to each task that are closest (in a NUMA-sense) to the CPU processes controlling them. To demonstrate:

srun -p gpu --cpus-per-task=18 --exclusive --mem-per-gpu=120000M --nodes=1 --ntasks-per-node=4 --gpus-per-node=4 --gpu-bind=closest numactl --show
srun: job 1225362 queued and waiting for resources
srun: job 1225362 has been allocated resources
policy: default
preferred node: current
physcpubind: 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
cpubind: 1
nodebind: 1
membind: 0 1
policy: default
preferred node: current
physcpubind: 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
cpubind: 1
nodebind: 1
membind: 0 1
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
cpubind: 0
nodebind: 0
membind: 0 1
policy: default
preferred node: current
physcpubind: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
cpubind: 0
nodebind: 0
membind: 0 1

As you can see, each task gets bound to 18 cores. This system has 36 cores per socket, and 2 sockets in total. SLURM binds task 0 to CPUs 0-17 on Socket 0 and sets CUDA_VISIBLE_DEVICES=0,1 for that task, because those two GPUs are attached to the PCI bridge of that socket. Similarly, it binds task 1 to CPUs 18-35 on Socket 0 and also sets CUDA_VISIBLE_DEVICES=0,1 (because those CPUs are still on Socket 0, so these are attached to that PCI bridge). Then, it binds task 2 to CPUs 36-53 on Socket 1, and set CUDA_VISIBLE_DEVICES=2,3, and it binds task 3 to CPUs 54-71 on Socket 1, and sets CUDA_VISIBLE_DEVICES=2,3. Please note that all of this is completely intended and correct SLURM behaviour. Thus, ideally, the SLURMEnvironment should be able to deal with it - yet it isn't.

I have tried to just hard-code the output of _is_slurm_managing_tasks to True, just to see if that would fix it. However, this just simply produces more errors: the SLURMEnvironment is now being used, but it results in

Traceback (most recent call last):
  File "/gpfs/home4/casparl/2D-VQ-AE-2/boring_model.py", line 64, in <module>
    run()
  File "/gpfs/home4/casparl/2D-VQ-AE-2/boring_model.py", line 61, in run
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
  File "/gpfs/home4/casparl/2D-VQ-AE-2/__pypackages__/3.9/lib/pytorch_lightning/trainer/trainer.py", line 768, in fit
    self._call_and_handle_interrupt(
  File "/gpfs/home4/casparl/2D-VQ-AE-2/__pypackages__/3.9/lib/pytorch_lightning/trainer/trainer.py", line 736, in _call_and_handle_interrupt
    self._teardown()
  File "/gpfs/home4/casparl/2D-VQ-AE-2/__pypackages__/3.9/lib/pytorch_lightning/trainer/trainer.py", line 1298, in _teardown
    self.strategy.teardown()
  File "/gpfs/home4/casparl/2D-VQ-AE-2/__pypackages__/3.9/lib/pytorch_lightning/strategies/ddp.py", line 471, in teardown
    if self.root_device.type == "cuda":
  File "/gpfs/home4/casparl/2D-VQ-AE-2/__pypackages__/3.9/lib/pytorch_lightning/strategies/ddp.py", line 117, in root_device
    return self.parallel_devices[self.local_rank]
IndexError: list index out of range

probably because the self._devices_flag is set to [0,1] for each task, even though two out of the tasks only have access to devices [2,3].

I'm not really sure how to 'fix' this, it would require changes both in the _is_slurm_managing_tasks function, but also in the way that SLURMEnvironment assigns devices to each of the processes...

cc @awaelchli @akihironitta

awaelchli commented 2 years ago

Hi @casparvl This is the implementation of _is_slurm_managing_tasks:

    def _is_slurm_managing_tasks(self) -> bool:
        """used by choosing cluster enviroment."""
        if not SLURMEnvironment.detect() or SLURMEnvironment.job_name() == "bash":
            return False

        total_requested_devices = len(self._parallel_devices) * self._num_nodes_flag
        num_slurm_tasks = int(os.environ["SLURM_NTASKS"], 0)
        return num_slurm_tasks == total_requested_devices

And yes, it needs to return True for your job to be launched correctly. Can you tell me which of these conditions return the wrong value? Please print the following values:

isinstance(trainer.strategy.cluster_environment, SLURMEnvironment)  # True
SLURMEnvironment.detect()  # True
SLURMEnvironment.job_name()  # should not be "bash"
os.environ["SLURM_NTASKS"]  # 4
len(trainer.accelerator_connector._parallel_devices)  # 4

Also, set

    trainer = Trainer(
        ...,
        accelerator='gpu',
        devices=4
        strategy='ddp',
    )

maybe the gpus=-1 is the problem.

casparvl commented 2 years ago

Hey @awaelchli ,

Sorry, I must have not explained my issue clearly - I actually know what is going wrong. In the _is_slurm_managing_tasks(self) function, the issue is that self._parallel_devices is 2, while os.environ["SLURM_NTASKS"] is 4. This then causes total_requested_devices to be 2, and num_slurm_tasks to be 4, which is why a False is returned.

What I tried to trace back in my explaination above is the reason why self._parallel_devices is 2. It is set based on self.accelerator.get_parallel_devices(self._devices_flag). self._devices_flag in turn is based on self.accelerator.auto_device_count(), which finally calls torch.cuda.device_count(). This returns 2 in my SLURM allocation because each process is bound to a subset of 2 GPUs (out of the four available on the node). That is completely valid behavior for SLURM: for performance, it is very desirable to make sure your GPUs are driven by the CPUs that are closest to them from a connectivity perspective, which is also why SLURM has this option in the first place. But PyTorch Lightning clearly assumes that each process can 'see' each GPU - that assumption is broken in these allocations.

Note that I put gpus=-1 because I want to autodetect the number of GPUs, but even when specifying gpus=4 I run into similar issues that arise from the fact that not all process see all GPUs. In that case I get:

pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested gpu: [0, 1, 2, 3]
 But your machine only has: [0, 1]

The problem is very clear to me. What I'm struggling with, is what would be an appropriate solution... It would require two parts:

A more fool-proof way of counting the total number of devices is needed, so that total_requested_devices would return 4 in my case, even if that process world consists of 4 processes which each have access to 2x2 GPUs. That will fix the issue of PT Lightning trying to spawn more processes itself. One suggested option here would be: allgather the CUDA_VISIBLE_DEVICES from each node, then get the unique items, and count the length of that list. Take an example where rank 0 and 1 each have access to GPU 0 and 1, and rank 2 and 3 have access to GPU 2 and 3. The allgather would result in [0, 1, 0, 1, 2, 3, 2, 3], taking the unique elements of the list would then result in [0, 1, 2, 3], of which the length is 4 - which is the the number of devices per node. Note that this would also work in case all processes see all GPUs, as the allgather would then result in [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3] => unique list [0,1,2,3] => length is 4. And it would also work in case each process only sees a single GPU, in which case the allgather results in [0,1,2,3] => unique list [0,1,2,3] => length is 4.
PyTorch Lighning will assign which rank should use which GPU IDs (I'm actually not entirely sure where this happens, but maybe you can tell me: how does each rank decide which GPU it can use?) in a way that does not respect the gpu-binding. My naive guess is that somewhere you set a mapping that rank 0 can use device 0, rank 1 can use device 1, rank 2 can use device 2, etc. This runs in to problems if e.g. rank 0 and rank 1 are closest to GPUs [2,3], and rank 2 and 3 are closest to GPUs [0,1]. Figuring out the correct GPU ID to assign to each rank is quite tricky in this case. Somehow, the code would have to figure out that e.g.

Rank	GPU ID
0	2
1	3
2	0
3	1

is a valid mapping, but e.g.

Rank	GPU ID
0	0
1	1
2	2
3	3

isn't.

All in all, it's far from trivial... I don't really have time to dive into this myself and fix it - also because I'm insufficiently familiar with the Lightning codebase. But I'm very willing to give the input needed (& do testing) if that helps to get it fixed.

For now I'll accept it as a requirement that Lightning only works properly if each process can see each GPU in a local node, but since that means we can't use GPU-binding, that might come at the cost of a (small) performance penalty.

awaelchli commented 2 years ago

Should we just get rid of _is_slurm_managing_tasks and always assume SLURM is managing tasks when SLURM is detected? This is a hidden behavior that stuck around since the very beginning of PL multi-node support. I'd like to rethink this in the context of https://github.com/Lightning-AI/lightning/issues/14078

These three lines here make the user experience extremely error-prone: https://github.com/Lightning-AI/lightning/blob/48c23e571637438726662104325c05ba768288be/src/pytorch_lightning/trainer/connectors/accelerator_connector.py#L579-L581

Any opinions from SLURM users here?

Lightning-AI / pytorch-lightning