Closed jonashaag closed 3 years ago
try using fast_dev_run to see if your validation loop works
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!
Hey @jonashaag , I think I'm having a similar problem. Did you find a work-around? Thanks.
Unfortunately I haven't. But from many other instances of multiprocessing problems, I think PyTorch/Python scientific stack has serious problems with multiprocessing. Whenever possible I'd recommend to use as little multiprocessing as possible, maybe putting work into entirely different workers using a job queue, etc. I guess there is just a TON of potential race conditions with CUDA, PyTorch, OpenMP, etc. all being used at the same time. I also always use OMP_NUM_THREADS=1
in multiprocessing environments, because otherwise things will be very slow and potentially deadlock, see here for one of the many cases where I pinned down one deadlock a bit further: https://github.com/numpy/numpy/issues/17752
Not an issue if I switch to 'ddp_spawn' with dataloaders = 0. Will try to get a reproducible example (but I desperately need my only machine with more than 1 gpu for training for a deadline). This is probably a pytorch issue, AFAIK; Some suggestions it's rooted in synced batch norm - see 20611 22671, 104831 (apparently not resolved with pytorch 1.7).
Have you found a solution @jamesjjcondon?
Unfortunately not @hkmztrk . Just have to restart periodically. I think this is pretty low-level, possibly even python multiprocessing related. Haven't tried with python3.8, yet, but not convinced it'll make a difference.
Interesting enough, I've only a single GPU machine and I'm experiencing the same issue. Is it an issue on a multiple GPU setup or any setup?
I also got the same behavior, PL is stuck in an infinite loop trying to get a batch, but it doesn't reach the dataset and I really don't know what to do. This is on CPU for me.
It seems that the code is stuck on an infinite loop in the pytorch dataloader.py
file in line 1147:
while True:
success, data = self._try_get_data()
if success:
return data
It was fixed when I set num_workers=0
, but it's still weird because I used more workers before without issue.
Is everyone using large images / small batch sizes, by chance?
I think we're gona need to make a reproducible example .
On Fri, 26 Mar 2021, 01:22 Nadav Schweiger, @.***> wrote:
I'm also getting the same behavior, PL is stuck in an infinite loop trying to get a batch, but it doesn't reach the dataset and I really don't know what to do. This is on CPU for me!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PyTorchLightning/pytorch-lightning/issues/4450#issuecomment-806890510, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIAOZRYHTEI7Z53J6LLT6GDTFNEZ3ANCNFSM4TFOAYWA .
hello @jamesjjcondon I'm observing the same issue. my code works fine on CPU and single GPU, but hangs before the first validation steps in a multi-GPU scenario I tried your way around setting ddp_spawn and num_worker =0 which runs! (thanks)
in my case, my image size is just (3,32,32) with a small batch size
Great. I'm actually still getting this even with ddp_spawn and num_workers=0 (FYI everyone else)
Same issue here!
The validation steps seems to be much longer than the training... did anyone find a solution to this issue
For what it's worth, I believe this issue still exists. (Validation step hanging when using num_workers > 1).
Attaching a debugger to the hung process reveals that the dataqueue.get is timing out and throwing the Empty exception even though all the workers are seemingly still alive. I haven't dug further, but a little problematic that you're forced to use num_workers = 1 in this case!
Same issue here, even with num_workers=0
and ddp_spawn
settings. Anyone got a fix?
Same issue here.. I think, for my the process stop. I am having a batch size of 1 and only 4 instance..
Same issue. Validation steps are taking way too long.
UPDATE: The freeze/idle time of the process happens when the model is trying to log something during validation step. If I remove logging in the validation step there is no problem at all.
Same issue. Validation steps are taking way too long.
UPDATE: The freeze/idle time of the process happens when the model is trying to log something during validation step. If I remove logging in the validation step there is no problem at all.
I have a same problem, please give me advice. How to remove logging in the validation step?
+1
I've had the same issue and managed to fix it by setting the dataloader persistent_workers
paramater to True
. Without it workers get killed at the end of each epoch, and then recreated at the start of the next which was so slow for me that training with 0 workers was faster. Not only did this occur at the start of each epoch but also when switching between training, validation and testing. With this option however it's definitely worth it to have num_workers > 0
and pin_memory = True
as there is no more delay.
Same issue here: After the training phase, and just before the validation, one of the process (the secondary one) hangs while the main process continue to run until it hangs too (after the validation phase). Note that in my case this issue occurs while using custom batch_sampler.
TO REPRODUCE: The pull request I was working on. First you should checkout to this commit. Later commits might not be relevant. Now just create a new python env, install the FuseMedML library, and run:
python examples/fuse_examples/imaging/classification/mnist/run_mnist_ddp.py
I'm running on:
Python 3.8.13
pytorch-lightning==1.7.7
torch==1.12.1
Note that the problem also occurs on python 3.7 and 3.10.
I had a similar issue. whenever using num_workers > 0. it either was hanging, or was getting segmentation faults, with some message about a detected deadlock:
pytorch_lightning.utilities.exceptions.DeadlockDetectedException: DeadLock detected from rank: 2
I also noticed the following warning (one line printed per worker):
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
this led me to isolating a librosa invocation that was causing all these problems: librosa.resample
(commenting out removed all issues)
read about resample a bit, and realized there are multiple resample methods (res_type
). default is soxr_hq
as described in: https://github.com/librosa/librosa/blob/8c4c1958888bb1c9a81af317106864af50dcf654/librosa/core/audio.py#L543
switched to res_type='scipy'
.
this cleared all errors and now everything runs nicely.
raised num_workers from 0 to 8 for both validation and train.
validation cycle reduced from 5:30 minutes to 1:10.
200 training steps reduced from 2:30 minutes to 1:00 minute.
didn't observe any quality degradation (so far).
was definitely worth debugging this.
hope this information might help, although it's not a general solution for librosa/numpy multi-threading/processing issues.
I had problem with a tesla A100 with
trainer = pl.Trainer(
devices=2,
num_nodes=4,
accelerator="gpu",
strategy="ddp_find_unused_parameters_false",
val_check_interval=conf.val_check_interval if num_nodes <= 1 else None,
)
It froze with something like
Validation Dataloader 0:
My guess for the cause was that the total num_workers when doing at the same time train and val became too big.
Solution:
I kept the train dataloader original
train_dataloader = DataLoader(
dataset=train_set,
batch_size=conf.batch,
shuffle=True,
num_workers=40,
pin_memory=ngpus > 0,
)
But reduced manually the num_workers in val dataloader from 40 to 4:
val_dataloader = DataLoader(
dataset=val_set,
batch_size=conf.batch,
shuffle=False,
num_workers=4,
pin_memory=ngpus > 0,
)
Somewhere in pytorch lightning manual it says that if the number of workers is too big , cpu memory gets filled and everyting crashes...
My dataloaders where stuck when setting num_workers > 0
on a particular machine (both on CPU and GPU). The solution was to change the multiprocessing method.
# solution from ChatGPT
import torch.multiprocessing as mp
mp.set_start_method('spawn', force=True) # You can also try 'fork' or 'forkserver'
I am having this issue as well, only on multi GPU training. Training loop runs normally, validation freezes before first step. I have tried every combination of persistent workers, pinned memory, and num_workers. Spawn mp does not solve it for me.
Besides being able to "solve" the freezing problem by setting num_workers=0
in the validation dataloader, I was also able to fix this on PL version 2.0.9 by adding sync_dist=True
to all the manual self.log
calls I had added in my pl.LightningModule
model steps, i.e.
def training_step(self, batch, batch_idx):
...
self.log(f"train.loss", loss, sync_dist=True)
def validation_step(self, batch, batch_idx):
...
self.log(f"val.loss", loss, sync_dist=True)
# + same for testing_step, which I didn't explicitly have
My Trainer
strategy was set to ddp
. I didn't test it myself, but ChatGPT said sync_dist=True
doesn't hurt in the non-DDP setting either. If you want to play safe, you could probably do something like sync_dist=self.trainer.world_size > 1
.
EDIT: same finding seems to have been raised in https://github.com/Lightning-AI/lightning/issues/8821#issuecomment-902402784
I had a similar issue. whenever using num_workers > 0. it either was hanging, or was getting segmentation faults, with some message about a detected deadlock:
pytorch_lightning.utilities.exceptions.DeadlockDetectedException: DeadLock detected from rank: 2
I also noticed the following warning (one line printed per worker):OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
this led me to isolating a librosa invocation that was causing all these problems:
librosa.resample
(commenting out removed all issues)read about resample a bit, and realized there are multiple resample methods (
res_type
). default issoxr_hq
as described in: https://github.com/librosa/librosa/blob/8c4c1958888bb1c9a81af317106864af50dcf654/librosa/core/audio.py#L543switched to
res_type='scipy'
. this cleared all errors and now everything runs nicely. raised num_workers from 0 to 8 for both validation and train. validation cycle reduced from 5:30 minutes to 1:10. 200 training steps reduced from 2:30 minutes to 1:00 minute. didn't observe any quality degradation (so far). was definitely worth debugging this.hope this information might help, although it's not a general solution for librosa/numpy multi-threading/processing issues.
@levhaikin Thank you for taking the time to debug and report the issue with librosa.resample
hanging when num_workers > 0 in a training loop. I am struck with exactly the same issue and I already identified librosa.resample
as the cause. I personally resolved it by caching the result of librosa.resample
so that the latter function only runs on the first epoch, which never exhibited the problem. I may try your workaround with res_type="scipy"
.
Thanks again.
why this is closed???? no official solution even now it is in 2024???? @williamFalcon @Borda
Same issue. Validation steps are taking way too long.
UPDATE: The freeze/idle time of the process happens when the model is trying to log something during validation step. If I remove logging in the validation step there is no problem at all.
Having the same problem, I noticed it too. I'm not sure but it seems like it happens when model is trying to save a checkpoint. It there a solution without logging removing?
🐛 Bug
After training epoch, before the first validation step, training gets stuck somewhere in the data loaders (I think).
I can't provide a reproduction script unfortunately: Getting the training into the specific situation takes a long time (must train for long enough for the situation to arise).
I train on 4x 1080 Ti using DDP and num_workers=20. After the first training epoch, before the first validation, training gets stuck. All GPUs are reported to have 100% compute and memory utilization, but only 50/250 W power consumption. Only the 4 main Python threads seem to be doing any working (busy looping?). The 20 worker processes seem to have been stopped already.
To me it looks like the main threads are still busy waiting for new samples, while the dataloaders have already gone.
Note that I use
limit_train_batches=0.1
, maybe this is the cause?Unfortunately I don't have ptrace capability on the machine, so can't use GDB etc. I printed the stack traces of all Python threads every 10s using a debugging thread. Logs of the hang situation are here: https://gist.github.com/jonashaag/b74ae9fc9267bde2cecd35ae316232c0
I am currently training without
limit_train_batches
to see if it's due to that setting. EDIT: No, I can also reproduce withoutlimit_train_batches
set.Environment