Closed esaracin closed 1 year ago
Which file contains your code that you are running? Is it hyperparameter_tune_lt_lstm.py
? Can you check the permissions on this file? Make sure you don't have any code in that file that changes the current working directory, otherwise the Trainer can't access and launch that script for the other processes.
What happens if you run our debug script with strategy="ddp"
and devices=2
?
https://github.com/Lightning-AI/lightning/blob/master/examples/pytorch/bug_report/bug_report_model.py
Yes, hyperparameter_tune_lt_lstm.py
is the main script that invokes the training. For some more insight, it's a script that leverages a luigi pipeline to train LSTM's and log them to our MLFlow instance. Ultimately, it instantiates successive TrainLSTM
tasks (which are the classes with the code that creates the Trainer and runs trainer.fit()
). I can try to post more of that code if you think it'd be useful.
As an earlier test (suspecting the same thing), I chmod 777'd the script just to ensure it would be accessible to any other processes. I also attempted running it with sudo
but was met with the same No such file or directory
error. There's also nowhere in the code where I change directories. Do you happen to know of any specific differences between ddp_spawn and ddp that would allow the script to run as expected with the former method?
I just ran the bug_report_script.py. For whatever reason, I had to change the from lightning.pytorch import LightningModule, Trainer
statement to from pytorch_lightning import LightningModule, Trainer
(since my system apparently doesn't know lightning by the former name). But with strategy='ddp'
and devices=2
(and the same Python environment), it runs without issue.
Missed the forest for the trees: there is an error with the path it's trying to open the script from. There's one too many subdirectories of the same name. There should only be 2 subdirectories with the same name that leads to the script (there are 3 in the path the spawned processes are looking for the script in).
So the question becomes: where do the subprocesses get their information on where the script is located? And why do they get it correct for ddp_spawn
and not ddp
?
Thanks for your continued help!
It is very simple. The regular ddp launcher creates a new process using subprocess.Popen:
And the command it uses gets constructed here in this function:
As you can see, the path is basically just given by os.path.abspath(sys.argv[0])
, which is just the name of the script and thus relying on the current working directory being the same from the start of the program.
Since you are using a 3rd party framework that creates these tasks, I suspect that it is changing the working dir and so by the time the trainer launches, it can't find the file in the current dir.
And why do they get it correct for ddp_spawn and not ddp? Yes, very simple. ddp_spawn uses the
multiprocessing.spawn
from torch/Python to create new processes. So it makes sense that it doesn't fail this way.
For single node training, ddp_spawn should serve you well. Did you see any limitations that pushed you to move away from it?
Ahh-- I think this might be me more than the third party framework, then: I'm using a Makefile to invoke the script from several directories up. When I run from the correct directory, it appears to get past the initialization phase (thanks for your help figuring that out!).
Unfortunately, I think I've found a reason I should be using 'ddp_spawn' with my use-case: 'ddp' appears to start the entire script over again, while 'ddp_spawn' just kicks off the multiprocessing from the point of the beginning of training to the end of the script. There's a lot of data to load in and preparation at the beginning of the script, which I then have to wait for 4 processes to get through before training can begin with 'ddp'. With 'ddp_spawn', the threads exist only from the trainer.fit() call through the end of the script.
As for the issues I've had with 'ddp_spawn', see below...
I'm running a script that iteratively runs my training loop. With ddp_spawn
, I can get through one training run just fine (and with all of the apparent benefits of multi-gpu training). However, when the next iteration comes along, I get met with this error in spawn.py
:
One of the spawned subprocesses gets hung up on an import statement in my script with the loop. Obviously, the module exists, or the main process wouldn't be able to run (and spawn the subprocesses).
So, running the same code, but on the next iteration, training fails to kick off correctly during the instantiation of those subprocesses, and I'm again met with a hanging process I have to kill. Effectively, it defeats the purpose of my loop, since only the first iteration works correctly, every time.
The other thing I notice with ddp_spawn is that, after completing training, it appears to leave a zombie process alive that continues throughout the execution of my program. I've added code to manually kill() and wait() on all child processes after training, which seems to get rid of the zombie, but it doesn't fix the issue of the subsequent loop failing to import my modules correctly:
Hey @esaracin Is this still a problem or did you make further progress debugging the issues?
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!
Bug description
I'm unable to use the suggested strategy of 'ddp' when training my model across multiple (4) GPUs on a single node. Using the default strategy of ddp_spawn leads to bugs and doesn't seem very stable (which is what the documentation seems to suggest).
Specifically, when I use 'ddp' (or some variation of 'ddp') as my strategy, training always hangs on the .fit() call, as the processes are being created. The first of 4 distributed processes appears to be created, and the other 3 error out on a "No such file or directory" error, pointing at the main script being executed. It's worth noting that the program doesn't crash, it just hangs (and in fact I need to close the terminal/execute pkill to end it), which suggests some sort of deadlock.
This happened both on 1.9.5, and now on 2.0.2
What version are you seeing the problem on?
v2_0
How to reproduce the bug
Error messages and logs
Per the above screenshot, I get:
Environment
How you installed Lightning:
pip
Running in AWS EC2 environment.More info
No response
cc @justusschock @awaelchli