Spandan-Madan / Pytorch_fine_tuning_Tutorial

A short tutorial on performing fine tuning or transfer learning in PyTorch.
278 stars 63 forks source link

ImportError: DLL load failed: The paging file is too small for this operation to complete. #10

Open toiyeumayhoc opened 5 years ago

toiyeumayhoc commented 5 years ago

after run the main_fine_tuning.py file, i got this trace back:

Epoch 0/99
LR is set to 0.001
Traceback (most recent call last):
  File "<string>", line 1, in <module>
Traceback (most recent call last):
  File "main_fine_tuning.py", line 265, in <module>
  File "C:\Users\dk12a7\Anaconda3\lib\multiprocessing\spawn.py", line 105, in spawn_main
    num_epochs=100)
  File "main_fine_tuning.py", line 162, in train_model
    for data in dset_loaders[phase]:
  File "C:\Users\dk12a7\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 501, in __iter__
    return _DataLoaderIter(self)
  File "C:\Users\dk12a7\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 289, in __init__
    w.start()
  File "C:\Users\dk12a7\Anaconda3\lib\multiprocessing\process.py", line 112, in start
    self._popen = self._Popen(self)
  File "C:\Users\dk12a7\Anaconda3\lib\multiprocessing\context.py", line 223, in _Popen
    exitcode = _main(fd)
  File "C:\Users\dk12a7\Anaconda3\lib\multiprocessing\spawn.py", line 114, in _main
    prepare(preparation_data)
  File "C:\Users\dk12a7\Anaconda3\lib\multiprocessing\spawn.py", line 225, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\dk12a7\Anaconda3\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
    run_name="__mp_main__")
  File "C:\Users\dk12a7\Anaconda3\lib\runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "C:\Users\dk12a7\Anaconda3\lib\runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "C:\Users\dk12a7\Anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\dk12a7\Desktop\code classification\Pytorch_fine_tuning_Tutorial\main_fine_tuning.py", line 4, in <module>
    import torch
  File "C:\Users\dk12a7\Anaconda3\lib\site-packages\torch\__init__.py", line 80, in <module>
    from torch._C import *
ImportError: DLL load failed: The paging file is too small for this operation to complete.
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\dk12a7\Anaconda3\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\Users\dk12a7\Anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\dk12a7\Anaconda3\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
BrokenPipeError: [Errno 32] Broken pipe

i tried to set the BATCH_SIZE =1 , but this problem still occur. Do you have any solution for this one?

brianFruit commented 5 years ago

I ran into the same problem, have you found a solution?

toiyeumayhoc commented 5 years ago

@brianFruit still stuck in this one.

MarcinMisiurewicz commented 5 years ago

I've also encountered that problem and it seems that this is a multiprocessing problem. What worked for me was reducing the number of workers in DataLoader (line 108 in your code). Your number is quite high - 25. Workers are subprocesses that load the data, so if you have 25 of them your cpu can rebell :) Try reducing it to 1, and if that works you can try to increase it. If I'm resoning correctly it shouldn't exceed number of your logical processors in CPU (but if you are comupting something parallely, like me rigth now, with another dataloader, you should decrease it even more).

Hope that help future generations

Javierete commented 3 years ago

Hi there, I find the same problem with my setups (both in Windows). Originally had a X99 with a 8 core CPU with 64GB of RAM and 2x RTX2080ti and was able to run up to 6x pytorch RL algorithms with up to 10 multiprocessing workers each (total 60 workers running in parallel - obviously they were taking turns). If I pushed passed those numbers, I would get those errors as described above. Now, I changed my setup to be a 3970X with 32 cores 64GB Ram and the same 2x GPUs. I can barely run 3x of the same algos with up to 8 workers each. Any loading more than that generates the same error. When running them the RAM used never more than 40-50%. Any pointing in the right direction will be highly appreciated. Thanks!

Javierete commented 3 years ago

I think I managed to solve it (so far). Steps were: 1)- Windows + pause key 2)- Advanced system settings 3)- Advanced tab 4)- Performance - Settings button 5)- Advanced tab - Change button 6)- Uncheck the "Automatically... BLA BLA" checkbox 7)- Select the System managed size option box. 8)- OK, OK, OK..... Restart PC. BOOM

Not sure if it's the best way to solve the problem but it worked so far (fingers crossed)

wood73 commented 3 years ago

@Javierete This solution is working for me - thanks! I noticed the error return for me when free space dipped below 7-8 GB for the application I'm running.

Javierete commented 3 years ago

Hi Woodrow73, If it's of any value, I ended up setting the values into manual and some ridiculous amount of 360GB as the minimum and 512GB for the maximum. I also added an extra SSD and allocated all of it to Virtual memory. This solved the problem and now I can run up to 128 processes using pytorch and CUDA. I did find out that every launch of Python and pytorch, loads some ridiculous amount of memory to the RAM and then when not used often goes into the virtual memory. Anyway, just sharing my learnings.

rlhull6 commented 3 years ago

I ran this on my PC and encountered the issue which seems like it should be the minimal in terms of memory usage.
import tensorflow as tf print(tf.version)

I just closed several applications and the problem went away so truly seems like resource issue.

Chetanvikram46 commented 3 years ago

TF.txt

Can someone please assist me on this error, I am kinda new to this so please help me out. I have attached the complete error message.

cobryan05 commented 2 years ago

I have managed to mitigate (although not completely solve) this issue. I posted a more detailed explanation at the StackOverflow link but basically try this:

Download: https://gist.github.com/cobryan05/7d1fe28dd370e110a372c4d268dcb2e5

Install dependency: python -m pip install pefile

Run (for OPs paths) (NOTE: THIS WILL MODIFY YOUR DLLS [although it will back them up]): python fixNvPe.py --input C:\Users\dk12a7\Anaconda3\lib\site-packages\torch\lib\*.dll

crazypythonista commented 2 years ago

6)- Uncheck the "Automatically... BLA BLA" checkbox

Hello, Thanks for the solution, but doesnt seem to work now. I got hp Pavilion 15-EC2150AX laptop and the settings specified doesnt appear at my side. Any sort of help will be highly appreciated.

Thanks

cobryan05 commented 2 years ago

Hello, Thanks for the solution, but doesnt seem to work now. I got hp Pavilion 15-EC2150AX laptop and the settings specified doesnt appear at my side. Any sort of help will be highly appreciated.

The setting name is "Automatically Manage Paging File Size For All Drives" and is at the top of the "Virtual Memory" page after clicking the 'change' button.

However, instead of making this change you should first try my fix in the comment immediately before yours, and only apply paging file size fixes if still necessary

For a description of what my fix does, see here: https://stackoverflow.com/a/69489193/213316 For a comparison of my fix against other fixes, see here: https://github.com/ultralytics/yolov3/issues/1643#issuecomment-985652432