benmoseley / FBPINNs

Solve forward and inverse problems related to partial differential equations using finite basis physics-informed neural networks (FBPINNs)
MIT License
293 stars 59 forks source link

Running Error #3

Closed Gaurav11ME closed 1 year ago

Gaurav11ME commented 3 years ago

Hello,

I am getting an error while running the file paper_main_1D.py. I am using Spyder IDE on Anaconda.

"OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\Users\gaura\Anaconda3\envs\torch\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies."

benmoseley commented 3 years ago

This looks related to this pytorch issue: https://github.com/ultralytics/yolov3/issues/1643

A few thoughts come to mind: 1) it could be a RAM issue, currently the script runs 23 parallel processes to train the runs defined in the script. You could try using DEVICES = ["cpu"]*4 to use less processes (and less RAM). 2) it could be due to the multiprocessing pool of workers defined in shared_modules/multiprocess.py not "playing nicely" with windows (all of my tests were carried out using Linux / MacOS). It is worth testing without this class (i.e. training all of the runs in a large for loop on the main thread) to see if this is a problem

Gaurav11ME commented 3 years ago

Hello Ben,

I tried running it on Linux (Ubuntu) as well, on a computer with better RAM. Now that error is gone but the program is not exiting. It remains stuck after 23 runs. Below is the screenshot of the output. Error_Message

benmoseley commented 3 years ago

That looks like normal behaviour, what should happen is that the script will also output a logging file per process in the current directory, named screenlog.main.[process id].log; if you look at these files you will see the training statistics output by each process as training progresses. I usually use the tailf linux command to monitor these files during training. Also you can use the top or htop linux commands to check your processes are indeed running, and if you are running on the GPU, using nvidia-smi or similar. The main program should stop once the training is complete across all the processes. Nb each training run is placed in a queue and the parallel processes concurrently process these until the queue is empty.