gem / oq-engine

OpenQuake's Engine for Seismic Hazard and Risk Analysis
https://github.com/gem/oq-engine/#openquake-engine
GNU Affero General Public License v3.0
374 stars 272 forks source link

openquake Hazard run job.ini crash Memory Error!!! #9748

Closed NimaDolatabadi closed 1 week ago

NimaDolatabadi commented 1 month ago

Hi Dear Developers,

The link to my question is https://groups.google.com/g/openquake-users/c/iXlpSM6usFY/m/xk-vCOWSAQAJ

I am trying to run a hazard project with very high resolution using the following setting: region_grid_spacing = 5. However, I am encountering an error related to low memory, or the process gets stuck at a certain percentage while calculating.

I'm asking the developers here to assist me. As a Python programmer, I took a look inside the OpenQuake core and found something interesting. (I am using OpenQuake Engine version 3.19 with Anaconda).

In openquake/engine/engine.py, I can clearly see an option to bypass the memory (we all know memory refers to RAM, not the CPU) in lines 239-295, specifically in used_mem = psutil.virtual_memory().percent.

First, I have to say that when running an OpenQuake hazard job, it never uses the memory resource, and the CPU is used to its maximum capacity. So why is there something to read the memory, which seems useless?

Second, in openquake/engine/openquake.cfg, I found some very interesting components that control how the CPU is utilized. I am listing them here: hard_mem_limit = 99, pmap_max_mb = 50, pmap_max_gb = 4. I discovered that the higher the values for the last two parameters, the quicker the run crashes with a low memory error. Conversely, the lower the values, the more resources are used. When I set them to 1, there were no crashes or memory errors, the CPU was used to maximum capacity, and the job completed successfully.

Third, I looked further into the code. The path openquake/calculators/classical.py is used when you define 'classical' in job.ini. I examined lines 108-153, which read the parameters pmap_max_mb = 50 and pmap_max_gb = 4. I can clearly see that this is related to memory estimation. If the calculation fills the memory, it raises a MemoryError('You ran out of memory!'). However, the calculation does not even use one percent of the memory, as it is using the CPU. Why does this crash due to a low memory error when, in reality, it is related to CPU usage?? This can be controlled from openquake/engine/openquake.cfg, and there are no overheating or hardware problems.

Thanks you in advance Ragards Nima

mmpagani commented 4 weeks ago

Dear @NimaDolatabadi , thanks for your message. Note that we assist on a best-effort basis, hence, assistance requests written with an aggressive tone do not put us in a positive mood to help. So, unless you change the tone of your requests, I am not planning to spend time helping you. Good luck.

NimaDolatabadi commented 4 weeks ago

@mmpagani Thanks for the reply. It will be revised though i would not forget your tone on oq-forum https://groups.google.com/g/openquake-users/c/4flmaTuxgCA/m/_ngbaH1JAAAJ deleting messages are not a proper way to act. ](url) (I consider that as accident). Regards Nima

mmpagani commented 4 weeks ago

I am afraid but I do not think my message was in any way offensive. The OpenQuake mailing list is moderated and the users must comply with its rules. Here https://groups.google.com/g/openquake-users you find a description of the mailing list goals, that I copy here for you "Welcome to the OpenQuake user group. Our aim is to provide the OpenQuake Community with a place to post questions, comments, and find solutions to issues regarding the use of OpenQuake."

NimaDolatabadi commented 4 weeks ago

@mmpagani ok. Main issue Remains Memory Error Regards

micheles commented 2 weeks ago

Nima, you are not giving us the means to help you. If you don't send us the calculation you are running, how can we know where the memory issue is? Also, you are not saying how much memory you have available, we recommend 4 GB per thread. At the moment you are the only one reporting this problem, so it could be a bug specific to your calculation or it could just be that you don't have the recommended amount of memory and then the only solution is to buy more memory. If you don't send us the job.zip file in the next days, I will just close the issue since there is nothing that I can do.

NimaDolatabadi commented 2 weeks ago

@micheles for some reason it is not possible to share the project but I may only share the job.ini for you. job.zip I have 56 cores on cpu and 128 GB Ram.

thorugh the given link I have asked the same https://groups.google.com/g/openquake-users/c/iXlpSM6usFY/m/xk-vCOWSAQAJ and got the answer to lower the grid but i believe this is not the true answer because due to changing some parameters I got my answers. just found out this might be a bug and wanted to be helpfull for better behavior of the openquake engine.

micheles commented 2 weeks ago

pmap_max_mb is related to the memory consumption on the workers and pmap_max_gb to the memory consumption on the master node, but without the model I cannot say where the problem is. I can mention that engine-3.20 had some memory optimization missing in version 3.19, so perhaps if you upgrade your calculation will work even without touching such parameters.