Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.35k stars 3.38k forks source link

Unable to use 'ddp' strategy: "No such file or directory" on initializing distributed #17530

Closed esaracin closed 1 year ago

esaracin commented 1 year ago

Bug description

I'm unable to use the suggested strategy of 'ddp' when training my model across multiple (4) GPUs on a single node. Using the default strategy of ddp_spawn leads to bugs and doesn't seem very stable (which is what the documentation seems to suggest).

Specifically, when I use 'ddp' (or some variation of 'ddp') as my strategy, training always hangs on the .fit() call, as the processes are being created. The first of 4 distributed processes appears to be created, and the other 3 error out on a "No such file or directory" error, pointing at the main script being executed. It's worth noting that the program doesn't crash, it just hangs (and in fact I need to close the terminal/execute pkill to end it), which suggests some sort of deadlock.

Screenshot 2023-04-29 at 11 06 18 AM

This happened both on 1.9.5, and now on 2.0.2

What version are you seeing the problem on?

v2_0

How to reproduce the bug

trainer = Trainer(
            deterministic=True,
            gradient_clip_val=0.5,
            precision=16,
            gradient_clip_algorithm="value",
            callbacks=callbacks,
            profiler="simple",
            max_epochs=self.max_epochs,
            logger=logger,
            devices=4,
            accelerator='gpu',
            strategy='ddp'
        )
trainer.fit(model, datamodule=datamodule)

Simply removing the strategy argument allows training to proceed as expected (when printing the strategy after omitting 'ddp', I've confirmed it defaults to DDPSpawn).

Error messages and logs

Per the above screenshot, I get:

[Errno 2] No such file or directory for the 3 alternate processes during their initialization.

Environment

How you installed Lightning: pip Running in AWS EC2 environment.

More info

No response

cc @justusschock @awaelchli

awaelchli commented 1 year ago

Which file contains your code that you are running? Is it hyperparameter_tune_lt_lstm.py? Can you check the permissions on this file? Make sure you don't have any code in that file that changes the current working directory, otherwise the Trainer can't access and launch that script for the other processes.

What happens if you run our debug script with strategy="ddp" and devices=2? https://github.com/Lightning-AI/lightning/blob/master/examples/pytorch/bug_report/bug_report_model.py

esaracin commented 1 year ago

Yes, hyperparameter_tune_lt_lstm.py is the main script that invokes the training. For some more insight, it's a script that leverages a luigi pipeline to train LSTM's and log them to our MLFlow instance. Ultimately, it instantiates successive TrainLSTM tasks (which are the classes with the code that creates the Trainer and runs trainer.fit()). I can try to post more of that code if you think it'd be useful.

As an earlier test (suspecting the same thing), I chmod 777'd the script just to ensure it would be accessible to any other processes. I also attempted running it with sudo but was met with the same No such file or directory error. There's also nowhere in the code where I change directories. Do you happen to know of any specific differences between ddp_spawn and ddp that would allow the script to run as expected with the former method?

I just ran the bug_report_script.py. For whatever reason, I had to change the from lightning.pytorch import LightningModule, Trainer statement to from pytorch_lightning import LightningModule, Trainer (since my system apparently doesn't know lightning by the former name). But with strategy='ddp' and devices=2 (and the same Python environment), it runs without issue.

esaracin commented 1 year ago

Missed the forest for the trees: there is an error with the path it's trying to open the script from. There's one too many subdirectories of the same name. There should only be 2 subdirectories with the same name that leads to the script (there are 3 in the path the spawned processes are looking for the script in).

So the question becomes: where do the subprocesses get their information on where the script is located? And why do they get it correct for ddp_spawn and not ddp?

Thanks for your continued help!

awaelchli commented 1 year ago

It is very simple. The regular ddp launcher creates a new process using subprocess.Popen:

https://github.com/Lightning-AI/lightning/blob/f6af74bf158a064673b5db284490b0da1f6c6852/src/lightning/pytorch/strategies/launchers/subprocess_script.py#L130-L134

And the command it uses gets constructed here in this function:

https://github.com/Lightning-AI/lightning/blob/984f49f7195ddc67e961c7c498ee6e19fc0cecb5/src/lightning/fabric/strategies/launchers/subprocess_script.py#L136C1-L142

As you can see, the path is basically just given by os.path.abspath(sys.argv[0]), which is just the name of the script and thus relying on the current working directory being the same from the start of the program.

Since you are using a 3rd party framework that creates these tasks, I suspect that it is changing the working dir and so by the time the trainer launches, it can't find the file in the current dir.

And why do they get it correct for ddp_spawn and not ddp? Yes, very simple. ddp_spawn uses the multiprocessing.spawn from torch/Python to create new processes. So it makes sense that it doesn't fail this way.

For single node training, ddp_spawn should serve you well. Did you see any limitations that pushed you to move away from it?

esaracin commented 1 year ago

Ahh-- I think this might be me more than the third party framework, then: I'm using a Makefile to invoke the script from several directories up. When I run from the correct directory, it appears to get past the initialization phase (thanks for your help figuring that out!).

Unfortunately, I think I've found a reason I should be using 'ddp_spawn' with my use-case: 'ddp' appears to start the entire script over again, while 'ddp_spawn' just kicks off the multiprocessing from the point of the beginning of training to the end of the script. There's a lot of data to load in and preparation at the beginning of the script, which I then have to wait for 4 processes to get through before training can begin with 'ddp'. With 'ddp_spawn', the threads exist only from the trainer.fit() call through the end of the script.

As for the issues I've had with 'ddp_spawn', see below...


I'm running a script that iteratively runs my training loop. With ddp_spawn, I can get through one training run just fine (and with all of the apparent benefits of multi-gpu training). However, when the next iteration comes along, I get met with this error in spawn.py:

Screenshot 2023-04-29 at 10 42 47 AM

One of the spawned subprocesses gets hung up on an import statement in my script with the loop. Obviously, the module exists, or the main process wouldn't be able to run (and spawn the subprocesses).

So, running the same code, but on the next iteration, training fails to kick off correctly during the instantiation of those subprocesses, and I'm again met with a hanging process I have to kill. Effectively, it defeats the purpose of my loop, since only the first iteration works correctly, every time.

The other thing I notice with ddp_spawn is that, after completing training, it appears to leave a zombie process alive that continues throughout the execution of my program. I've added code to manually kill() and wait() on all child processes after training, which seems to get rid of the zombie, but it doesn't fix the issue of the subsequent loop failing to import my modules correctly:

Screenshot 2023-05-01 at 7 23 08 PM
awaelchli commented 1 year ago

Hey @esaracin Is this still a problem or did you make further progress debugging the issues?

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!