Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.38k stars 3.38k forks source link

Data loading hangs before first validation step #4450

Closed jonashaag closed 3 years ago

jonashaag commented 4 years ago

🐛 Bug

After training epoch, before the first validation step, training gets stuck somewhere in the data loaders (I think).

I can't provide a reproduction script unfortunately: Getting the training into the specific situation takes a long time (must train for long enough for the situation to arise).

I train on 4x 1080 Ti using DDP and num_workers=20. After the first training epoch, before the first validation, training gets stuck. All GPUs are reported to have 100% compute and memory utilization, but only 50/250 W power consumption. Only the 4 main Python threads seem to be doing any working (busy looping?). The 20 worker processes seem to have been stopped already.

To me it looks like the main threads are still busy waiting for new samples, while the dataloaders have already gone.

Note that I use limit_train_batches=0.1, maybe this is the cause?

Unfortunately I don't have ptrace capability on the machine, so can't use GDB etc. I printed the stack traces of all Python threads every 10s using a debugging thread. Logs of the hang situation are here: https://gist.github.com/jonashaag/b74ae9fc9267bde2cecd35ae316232c0

I am currently training without limit_train_batches to see if it's due to that setting. EDIT: No, I can also reproduce without limit_train_batches set.

Environment

* CUDA:
        - GPU:
                - GeForce GTX 1080 Ti
                - GeForce GTX 1080 Ti
                - GeForce GTX 1080 Ti
                - GeForce GTX 1080 Ti
        - available:         True
        - version:           11.0
* Packages:
        - numpy:             1.19.2
        - pyTorch_debug:     True
        - pyTorch_version:   1.8.0.dev20201028
        - pytorch-lightning: 0.10.0
        - tqdm:              4.51.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                -
        - processor:         x86_64
        - python:            3.7.8
        - version:           #88~16.04.1-Ubuntu SMP Wed Feb 12 04:19:15 UTC 2020
s-rog commented 4 years ago

try using fast_dev_run to see if your validation loop works

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

jamesjjcondon commented 3 years ago

Hey @jonashaag , I think I'm having a similar problem. Did you find a work-around? Thanks.

jonashaag commented 3 years ago

Unfortunately I haven't. But from many other instances of multiprocessing problems, I think PyTorch/Python scientific stack has serious problems with multiprocessing. Whenever possible I'd recommend to use as little multiprocessing as possible, maybe putting work into entirely different workers using a job queue, etc. I guess there is just a TON of potential race conditions with CUDA, PyTorch, OpenMP, etc. all being used at the same time. I also always use OMP_NUM_THREADS=1 in multiprocessing environments, because otherwise things will be very slow and potentially deadlock, see here for one of the many cases where I pinned down one deadlock a bit further: https://github.com/numpy/numpy/issues/17752

jamesjjcondon commented 3 years ago

Not an issue if I switch to 'ddp_spawn' with dataloaders = 0. Will try to get a reproducible example (but I desperately need my only machine with more than 1 gpu for training for a deadline). This is probably a pytorch issue, AFAIK; Some suggestions it's rooted in synced batch norm - see 20611 22671, 104831 (apparently not resolved with pytorch 1.7).

hkmztrk commented 3 years ago

Have you found a solution @jamesjjcondon?

jamesjjcondon commented 3 years ago

Unfortunately not @hkmztrk . Just have to restart periodically. I think this is pretty low-level, possibly even python multiprocessing related. Haven't tried with python3.8, yet, but not convinced it'll make a difference.

alvis commented 3 years ago

Interesting enough, I've only a single GPU machine and I'm experiencing the same issue. Is it an issue on a multiple GPU setup or any setup?

NadavLightricks commented 3 years ago

I also got the same behavior, PL is stuck in an infinite loop trying to get a batch, but it doesn't reach the dataset and I really don't know what to do. This is on CPU for me.

It seems that the code is stuck on an infinite loop in the pytorch dataloader.py file in line 1147:

            while True:
                success, data = self._try_get_data()
                if success:
                    return data

It was fixed when I set num_workers=0, but it's still weird because I used more workers before without issue.

jamesjjcondon commented 3 years ago

Is everyone using large images / small batch sizes, by chance?

I think we're gona need to make a reproducible example .

On Fri, 26 Mar 2021, 01:22 Nadav Schweiger, @.***> wrote:

I'm also getting the same behavior, PL is stuck in an infinite loop trying to get a batch, but it doesn't reach the dataset and I really don't know what to do. This is on CPU for me!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PyTorchLightning/pytorch-lightning/issues/4450#issuecomment-806890510, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIAOZRYHTEI7Z53J6LLT6GDTFNEZ3ANCNFSM4TFOAYWA .

wtseng530 commented 3 years ago

hello @jamesjjcondon I'm observing the same issue. my code works fine on CPU and single GPU, but hangs before the first validation steps in a multi-GPU scenario I tried your way around setting ddp_spawn and num_worker =0 which runs! (thanks)

in my case, my image size is just (3,32,32) with a small batch size

jamesjjcondon commented 3 years ago

Great. I'm actually still getting this even with ddp_spawn and num_workers=0 (FYI everyone else)

aleSuglia commented 3 years ago

Same issue here!

sounakdey commented 2 years ago

The validation steps seems to be much longer than the training... did anyone find a solution to this issue

UmaisZahid commented 2 years ago

For what it's worth, I believe this issue still exists. (Validation step hanging when using num_workers > 1).

Attaching a debugger to the hung process reveals that the dataqueue.get is timing out and throwing the Empty exception even though all the workers are seemingly still alive. I haven't dug further, but a little problematic that you're forced to use num_workers = 1 in this case!

aasharma90 commented 2 years ago

Same issue here, even with num_workers=0 and ddp_spawn settings. Anyone got a fix?

espoirMur commented 2 years ago

Same issue here.. I think, for my the process stop. I am having a batch size of 1 and only 4 instance..

kacwin commented 2 years ago

Same issue. Validation steps are taking way too long.

UPDATE: The freeze/idle time of the process happens when the model is trying to log something during validation step. If I remove logging in the validation step there is no problem at all.

lenshin commented 2 years ago

Same issue. Validation steps are taking way too long.

UPDATE: The freeze/idle time of the process happens when the model is trying to log something during validation step. If I remove logging in the validation step there is no problem at all.

I have a same problem, please give me advice. How to remove logging in the validation step?

GongXinyuu commented 2 years ago

+1

KuSi833 commented 2 years ago

I've had the same issue and managed to fix it by setting the dataloader persistent_workers paramater to True. Without it workers get killed at the end of each epoch, and then recreated at the start of the next which was so slow for me that training with 0 workers was faster. Not only did this occur at the start of each epoch but also when switching between training, validation and testing. With this option however it's definitely worth it to have num_workers > 0 and pin_memory = True as there is no more delay.

SagiPolaczek commented 2 years ago

Same issue here: After the training phase, and just before the validation, one of the process (the secondary one) hangs while the main process continue to run until it hangs too (after the validation phase). Note that in my case this issue occurs while using custom batch_sampler.

TO REPRODUCE: The pull request I was working on. First you should checkout to this commit. Later commits might not be relevant. Now just create a new python env, install the FuseMedML library, and run:

python examples/fuse_examples/imaging/classification/mnist/run_mnist_ddp.py

I'm running on:

Python 3.8.13
pytorch-lightning==1.7.7
torch==1.12.1

Note that the problem also occurs on python 3.7 and 3.10.

levhaikin commented 1 year ago

I had a similar issue. whenever using num_workers > 0. it either was hanging, or was getting segmentation faults, with some message about a detected deadlock: pytorch_lightning.utilities.exceptions.DeadlockDetectedException: DeadLock detected from rank: 2 I also noticed the following warning (one line printed per worker): OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

this led me to isolating a librosa invocation that was causing all these problems: librosa.resample (commenting out removed all issues)

read about resample a bit, and realized there are multiple resample methods (res_type). default is soxr_hq as described in: https://github.com/librosa/librosa/blob/8c4c1958888bb1c9a81af317106864af50dcf654/librosa/core/audio.py#L543

switched to res_type='scipy'. this cleared all errors and now everything runs nicely. raised num_workers from 0 to 8 for both validation and train. validation cycle reduced from 5:30 minutes to 1:10. 200 training steps reduced from 2:30 minutes to 1:00 minute. didn't observe any quality degradation (so far). was definitely worth debugging this.

hope this information might help, although it's not a general solution for librosa/numpy multi-threading/processing issues.

paapu88 commented 1 year ago

I had problem with a tesla A100 with

trainer = pl.Trainer(
        devices=2,
        num_nodes=4,
        accelerator="gpu",
        strategy="ddp_find_unused_parameters_false",  
        val_check_interval=conf.val_check_interval if num_nodes <= 1 else None,
)

It froze with something like Validation Dataloader 0: My guess for the cause was that the total num_workers when doing at the same time train and val became too big. Solution: I kept the train dataloader original

 train_dataloader = DataLoader(
        dataset=train_set,
        batch_size=conf.batch,
        shuffle=True,
        num_workers=40,
        pin_memory=ngpus > 0,
    )

But reduced manually the num_workers in val dataloader from 40 to 4:

   val_dataloader = DataLoader(
        dataset=val_set,
        batch_size=conf.batch,
        shuffle=False,
        num_workers=4,
        pin_memory=ngpus > 0,
    )

Somewhere in pytorch lightning manual it says that if the number of workers is too big , cpu memory gets filled and everyting crashes...

LucaMarconato commented 1 year ago

My dataloaders where stuck when setting num_workers > 0 on a particular machine (both on CPU and GPU). The solution was to change the multiprocessing method.

# solution from ChatGPT
import torch.multiprocessing as mp

mp.set_start_method('spawn', force=True)  # You can also try 'fork' or 'forkserver'
tonydavis629 commented 1 year ago

I am having this issue as well, only on multi GPU training. Training loop runs normally, validation freezes before first step. I have tried every combination of persistent workers, pinned memory, and num_workers. Spawn mp does not solve it for me.

mjkvaak commented 1 year ago

Besides being able to "solve" the freezing problem by setting num_workers=0 in the validation dataloader, I was also able to fix this on PL version 2.0.9 by adding sync_dist=True to all the manual self.log calls I had added in my pl.LightningModule model steps, i.e.

def training_step(self, batch, batch_idx):
   ...
   self.log(f"train.loss", loss, sync_dist=True)

def validation_step(self, batch, batch_idx):
   ...
   self.log(f"val.loss", loss, sync_dist=True)

# + same for testing_step, which I didn't explicitly have

My Trainer strategy was set to ddp. I didn't test it myself, but ChatGPT said sync_dist=True doesn't hurt in the non-DDP setting either. If you want to play safe, you could probably do something like sync_dist=self.trainer.world_size > 1.

EDIT: same finding seems to have been raised in https://github.com/Lightning-AI/lightning/issues/8821#issuecomment-902402784

daniel347x commented 4 months ago

I had a similar issue. whenever using num_workers > 0. it either was hanging, or was getting segmentation faults, with some message about a detected deadlock: pytorch_lightning.utilities.exceptions.DeadlockDetectedException: DeadLock detected from rank: 2 I also noticed the following warning (one line printed per worker): OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

this led me to isolating a librosa invocation that was causing all these problems: librosa.resample (commenting out removed all issues)

read about resample a bit, and realized there are multiple resample methods (res_type). default is soxr_hq as described in: https://github.com/librosa/librosa/blob/8c4c1958888bb1c9a81af317106864af50dcf654/librosa/core/audio.py#L543

switched to res_type='scipy'. this cleared all errors and now everything runs nicely. raised num_workers from 0 to 8 for both validation and train. validation cycle reduced from 5:30 minutes to 1:10. 200 training steps reduced from 2:30 minutes to 1:00 minute. didn't observe any quality degradation (so far). was definitely worth debugging this.

hope this information might help, although it's not a general solution for librosa/numpy multi-threading/processing issues.

@levhaikin Thank you for taking the time to debug and report the issue with librosa.resample hanging when num_workers > 0 in a training loop. I am struck with exactly the same issue and I already identified librosa.resample as the cause. I personally resolved it by caching the result of librosa.resample so that the latter function only runs on the first epoch, which never exhibited the problem. I may try your workaround with res_type="scipy".

Thanks again.

J-zin commented 3 months ago

why this is closed???? no official solution even now it is in 2024???? @williamFalcon @Borda

Andrey-rmnv commented 2 months ago

Same issue. Validation steps are taking way too long.

UPDATE: The freeze/idle time of the process happens when the model is trying to log something during validation step. If I remove logging in the validation step there is no problem at all.

Having the same problem, I noticed it too. I'm not sure but it seems like it happens when model is trying to save a checkpoint. It there a solution without logging removing?