Closed TheMrZZ closed 1 year ago
Perhaps this issue persists? Still experiencing similar freeze for validations, version is 2.2.4
This issue still exists in version 2.3.3. With higher num_workers, the time between epochs is significantly longer. I tested the influence of saving checkpointing or hyperparameters and finds that these settings do not affect the runtime. The bug is the same as the initial finding which is caused by dataloader.
An easy fix can be setting num_workers=0
or add persistent_workers=True
when instantiating dataloader.
We let such an ugly bug that has been fixed before seriously affect the operation speed of the entire Lightning 2.0.
I think this issue need to be reopened and fixed as soon as possible. @awaelchli
persistent_workers=True
did not actually worked around the issue when I was testing
On Wed, Jul 31, 2024 at 21:23 Jin Zehao @.***> wrote:
This issue still exists in version 2.3.3. With higher num_workers, the time between epochs is significantly longer. I tested the influence of saving checkpointing or hyperparameters and finds that these settings do not affect the runtime. The bug is the same as the initial finding which is caused by dataloader.
An easy fix can be setting num_workers=0 or add persistent_workers=True when instantiating dataloader.
We let such an ugly bug that has been fixed before seriously affect the operation speed of the entire Lightning 2.0.
I think this issue need to be reopened and fixed as soon as possible. @awaelchli https://github.com/awaelchli
— Reply to this email directly, view it on GitHub https://github.com/Lightning-AI/pytorch-lightning/issues/10389#issuecomment-2261886326, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAP5M3FKYTFCQBF73YGK6JLZPGS2PAVCNFSM5HPZAV3KU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMRWGE4DQNRTGI3A . You are receiving this because you commented.Message ID: @.***>
Hi, version v2.4 still has this issue, and the recommendations mentioned are not working.
Folks, I might have a solution.
TL;DR: use OMP_NUM_THREADS=1 MKL_NUM_THREADS=1
The problem stems from a combination of weird torch defaults, using Slurm or some comparable scheduler tool without containerization, and a large num_worker count.
By default, torch uses as many threads as possible for interop and intraop operations. This "as many" is determined by the number of CPU cores in your system (see here).
If you are using a scheduler such as Slurm, torch will think that you have access to all the CPUs in your machine (since the node resources are visible to the job) even if you have limited the number of cores allocated to the job. Therefore, e.g. in a 100-core node, torch will spawn hundreds of threads for each worker, suffocating your system.
The solution is to reduce the number of threads that torch can spawn using the above-mentioned environment variables (1 is not required, I believe, but keeping it somewhat close to the actual number of CPUs would be smart). Alternatively, use containerization, people. Don't let Slurm pull you into its evil ways.
In my experiments, this seems to resolve a couple deadlocks I have been hitting, and considerably improve the behavior for this particular issue. There is still some delay when switching between train and validation workers, which might be a bug on lightning side (verification needed), but at least the training is now manageable.
This might be the same issue as #4450, or pretty much most other non-reproducible performance issues in torch/lightning repos.
I tracked down my problem to evaluation_loop.py in PL. This line of code iter(data_fetcher) # creates the iterator inside the fetcher takes too much to run. I guess data_fetcher is culprit here.
in my case persistent_workers=True solved the issue
in my case persistent_workers=True solved the issue
Sounds good, would you give a try?
I converted some Pytorch code to Lightning. The dataset is loaded lazily by the train & eval dataloaders.
However, when moving the code to Lightning, I noticed a huge slowdown. After digging around, I noticed that there was a ~10 seconds delay between each epoch. For comparison, on my vanilla Pytorch, an epoch takes ~4s.
I first thought it was a data loading problem, but during the 10s delay, no data is loaded (at least that's what my
print
tell me).I think the issue is related to the number of workers, because setting
n_workers=0
solves the problem (but is slower in the end, since only one worker is not enough). I know starting workers is slow, however I havepersistent_workers=True
and this does not happen in normal Pytorch. My data loaders also havepin_memory=True
(removing pin_memory does not solve the problem).Since this is company code, I cannot disclose the before/after, but I'll try to "anonymize" some code if necessary. Here is the lightning module:
Here is the result of
profiler="simple"
:Here is the result of
profiler="advanced"
: https://pastebin.com/q3C5P826.Finally, here is a video demonstrating the problem. I'm printing each piece of data loading, to prove it's not the issue. https://user-images.githubusercontent.com/30944236/140587623-ae184fa3-370a-42be-8593-200026d11ba4.mp4
Random informations:
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
cc @tchaton @rohitgr7 @borda @akihironitta