Closed casparvl closed 2 years ago
Hi @casparvl This is the implementation of _is_slurm_managing_tasks:
def _is_slurm_managing_tasks(self) -> bool:
"""used by choosing cluster enviroment."""
if not SLURMEnvironment.detect() or SLURMEnvironment.job_name() == "bash":
return False
total_requested_devices = len(self._parallel_devices) * self._num_nodes_flag
num_slurm_tasks = int(os.environ["SLURM_NTASKS"], 0)
return num_slurm_tasks == total_requested_devices
And yes, it needs to return True for your job to be launched correctly. Can you tell me which of these conditions return the wrong value? Please print the following values:
isinstance(trainer.strategy.cluster_environment, SLURMEnvironment) # True
SLURMEnvironment.detect() # True
SLURMEnvironment.job_name() # should not be "bash"
os.environ["SLURM_NTASKS"] # 4
len(trainer.accelerator_connector._parallel_devices) # 4
Also, set
trainer = Trainer(
...,
accelerator='gpu',
devices=4
strategy='ddp',
)
maybe the gpus=-1
is the problem.
Hey @awaelchli ,
Sorry, I must have not explained my issue clearly - I actually know what is going wrong. In the _is_slurm_managing_tasks(self)
function, the issue is that self._parallel_devices
is 2
, while os.environ["SLURM_NTASKS"]
is 4
. This then causes total_requested_devices
to be 2
, and num_slurm_tasks
to be 4
, which is why a False
is returned.
What I tried to trace back in my explaination above is the reason why self._parallel_devices
is 2
. It is set based on self.accelerator.get_parallel_devices(self._devices_flag)
. self._devices_flag
in turn is based on self.accelerator.auto_device_count()
, which finally calls torch.cuda.device_count()
. This returns 2
in my SLURM allocation because each process is bound to a subset of 2 GPUs (out of the four available on the node). That is completely valid behavior for SLURM: for performance, it is very desirable to make sure your GPUs are driven by the CPUs that are closest to them from a connectivity perspective, which is also why SLURM has this option in the first place. But PyTorch Lightning clearly assumes that each process can 'see' each GPU - that assumption is broken in these allocations.
Note that I put gpus=-1
because I want to autodetect the number of GPUs, but even when specifying gpus=4
I run into similar issues that arise from the fact that not all process see all GPUs. In that case I get:
pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested gpu: [0, 1, 2, 3]
But your machine only has: [0, 1]
The problem is very clear to me. What I'm struggling with, is what would be an appropriate solution... It would require two parts:
A more fool-proof way of counting the total number of devices is needed, so that total_requested_devices
would return 4 in my case, even if that process world consists of 4 processes which each have access to 2x2 GPUs. That will fix the issue of PT Lightning trying to spawn more processes itself. One suggested option here would be: allgather the CUDA_VISIBLE_DEVICES
from each node, then get the unique items, and count the length of that list. Take an example where rank 0 and 1 each have access to GPU 0 and 1, and rank 2 and 3 have access to GPU 2 and 3. The allgather would result in [0, 1, 0, 1, 2, 3, 2, 3], taking the unique elements of the list would then result in [0, 1, 2, 3], of which the length is 4 - which is the the number of devices per node. Note that this would also work in case all processes see all GPUs, as the allgather would then result in [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3] => unique list [0,1,2,3] => length is 4. And it would also work in case each process only sees a single GPU, in which case the allgather results in [0,1,2,3] => unique list [0,1,2,3] => length is 4.
PyTorch Lighning will assign which rank should use which GPU IDs (I'm actually not entirely sure where this happens, but maybe you can tell me: how does each rank decide which GPU it can use?) in a way that does not respect the gpu-binding. My naive guess is that somewhere you set a mapping that rank 0 can use device 0, rank 1 can use device 1, rank 2 can use device 2, etc. This runs in to problems if e.g. rank 0 and rank 1 are closest to GPUs [2,3], and rank 2 and 3 are closest to GPUs [0,1]. Figuring out the correct GPU ID to assign to each rank is quite tricky in this case. Somehow, the code would have to figure out that e.g.
Rank | GPU ID |
---|---|
0 | 2 |
1 | 3 |
2 | 0 |
3 | 1 |
is a valid mapping, but e.g.
Rank | GPU ID |
---|---|
0 | 0 |
1 | 1 |
2 | 2 |
3 | 3 |
isn't.
All in all, it's far from trivial... I don't really have time to dive into this myself and fix it - also because I'm insufficiently familiar with the Lightning codebase. But I'm very willing to give the input needed (& do testing) if that helps to get it fixed.
For now I'll accept it as a requirement that Lightning only works properly if each process can see each GPU in a local node, but since that means we can't use GPU-binding, that might come at the cost of a (small) performance penalty.
Should we just get rid of _is_slurm_managing_tasks
and always assume SLURM is managing tasks when SLURM is detected? This is a hidden behavior that stuck around since the very beginning of PL multi-node support. I'd like to rethink this in the context of https://github.com/Lightning-AI/lightning/issues/14078
These three lines here make the user experience extremely error-prone: https://github.com/Lightning-AI/lightning/blob/48c23e571637438726662104325c05ba768288be/src/pytorch_lightning/trainer/connectors/accelerator_connector.py#L579-L581
Any opinions from SLURM users here?
🐛 Bug && steps to reproduce
I'm running the standard
boring_model
(or any model), with some minor changes to the arguments:on our SLURM cluster like so:
however, it seems to spawns 8 processes: 4 'worlds' of size two. While running, each GPU is running two processes. I also don't see "Multiprocessing is handled by SLURM" in the output, which is expected from here.
Expected behavior
I expect
srun
to launch 4 tasks, and PyTorch Lightnings SLURMEnvironment to make these into a single world of size 4.Environment
Additional context
The issue first poped up in a real-world code where I was combining PT Lightning with Hydra, and showed up as a rather strange error message from Hydra. This thread contains a lot more analysis on that. However, I'll repeat/summarize the essentials here.
The problem arrises because
_is_slurm_managing_tasks(self)
inaccelerator_connector.py
is returningFalse
, despite the tasks being launched by SLURM. The result is that PyTorch Lightning will itself try to spawn tasks, from each of the four processes already launched bysrun
. As we'll see later, each process sees 2 GPUs, and thus launches two tasks, thus I end up with my final total of 8 tasks.Since PyTorch Lightning thinks it's running outside of a SLURM context, that also causes the hydra error. This is because Lightning will use
subprocess_script.py
to try and launch a new process. Here it will addhydra.run.dir
to the original argument list and call the original command again, but since the original command is the_submit.py
script (and not the PyTorch lightning training script) this fails in an unrecognized argument error.That part aside, the real problem is: why doesn't Lightning recognize that slurm is managing the tasks? Well, that's because of this line. Lightning assumes two things here:
self._parallel_devices
contains the number of physical parallel devices that is available per nodeI won't go deeper into number (2), since that assumption isn't broken here - I just want to remark that
_is_slurm_managing_tasks()
logic would also break in that case. In our case, the problem is (1).Let's see what's going on step by step.
self._parallel_devices
is set here, based on a call toself.accelerator.get_parallel_devices(self._devices_flag)
. In turn, theself._devices_flag
is set here in my case, based on a call toself.accelerator.auto_device_count()
. Since my accelerator is a GPU,self.accelerator
is aGPUAccelerator
object. Thus, it's calling this function. And this, suprisingly, is where the problem lies.auto_device_count
will return the number of devices that that process has access to. That is potentially not the same as the number of _physical parallel devices that is available per node` in a SLURM allocation.To demonstrate, consider the following run:
As you can see, SLURM launches 4 tasks. Each of those tasks get access to a subset of the devices available on the node. In this case,
auto_device_count
will return 2 - the number of GPUs that that particular process has access to. The reason it only has access to two is because of the--gpu-bind=closest
argument (see the documentation of srun). This will make sure that GPUs are bound to each task that are closest (in a NUMA-sense) to the CPU processes controlling them. To demonstrate:As you can see, each task gets bound to 18 cores. This system has 36 cores per socket, and 2 sockets in total. SLURM binds task 0 to CPUs 0-17 on Socket 0 and sets
CUDA_VISIBLE_DEVICES=0,1
for that task, because those two GPUs are attached to the PCI bridge of that socket. Similarly, it binds task 1 to CPUs 18-35 on Socket 0 and also setsCUDA_VISIBLE_DEVICES=0,1
(because those CPUs are still on Socket 0, so these are attached to that PCI bridge). Then, it binds task 2 to CPUs 36-53 on Socket 1, and setCUDA_VISIBLE_DEVICES=2,3
, and it binds task 3 to CPUs 54-71 on Socket 1, and setsCUDA_VISIBLE_DEVICES=2,3
. Please note that all of this is completely intended and correct SLURM behaviour. Thus, ideally, theSLURMEnvironment
should be able to deal with it - yet it isn't.I have tried to just hard-code the output of
_is_slurm_managing_tasks
toTrue
, just to see if that would fix it. However, this just simply produces more errors: theSLURMEnvironment
is now being used, but it results inprobably because the
self._devices_flag
is set to[0,1]
for each task, even though two out of the tasks only have access to devices[2,3]
.I'm not really sure how to 'fix' this, it would require changes both in the
_is_slurm_managing_tasks
function, but also in the way thatSLURMEnvironment
assigns devices to each of the processes...cc @awaelchli @akihironitta