gem / oq-engine

OpenQuake Engine: a software for Seismic Hazard and Risk Analysis
https://github.com/gem/oq-engine/#openquake-engine
GNU Affero General Public License v3.0
375 stars 272 forks source link

PSHA outputs filled with zero/nan values on 3.16.6 LTS #9057

Open alejocaldeGEM opened 9 months ago

alejocaldeGEM commented 9 months ago

Issue

I found an issue with this particular PSHA calculation giving wrong outputs on the engine version 3.16.6. This calculation was run on machines running in Windows 10 and Windows 11.

In Windows 10, this PSHA calculation runs successfully, but in the outputs (some) quantiles come with zero values (see the images provided for a 0.15 quantile hazard curve, and a 0.15 quantile UHS).

In Windows 11 with the latest security updates, the calculation also runs successfully, but it produces the outputs, including hazard maps, with NAN values. I have confirmed this, but can not provide screenshots or further testing because my machine runs in Windows 10.

Expected behavior

Notice that this is a calculation using the EFEHR model. It is not technically correct: uses a dummy site model and samples only 10 branches. However, regardless of the Windows version, I believe it should be able to provide the lower quantiles, even sampling just 2 branches. It seems an issue with how the outputs are formed, but I have not investigated further.

Since this is the LTS, a issue manifesting differently in different OS, and a 'tricky' PSHA model, I think its worth to investigate further. Let me know if I can help with anything.

Thank you,

Alejandro

image image PSHA.zip

micheles commented 9 months ago

Getting NaNs looks like a bug, but getting the zeros looks perfectly correct, given the definition we use for quantiles. It is the same in linux with the current master. The issue is that you are sampling 10 realizations and the first 4 realizations have exactly zero hazard curves. Those zeros cause the quantiles to be zeros. For instance

>>> from openquake.hazardlib.stats import quantile_curve
>>> quantile_curve(.15, [0, 0, 0, 0, .1, .2, .3, .4, .5, .6])
array(0.)

Now one should investigate why the first 4 realizations are exactly zeros, but that looks more like a scientific question than a bug. My guess is that you are in Croatia and investigating branches of the source model logic tree corresponding to Iceland and thus having zero effect.

alejocaldeGEM commented 9 months ago

Thank you Michele. Pretty clear. Regarding the NaNs in Windows 11, I have requested the users that found the issue to send me the specifications of their machines to test further internally.

micheles commented 9 months ago

Rather than the full model one should use a reduced EUR model suitable for Croazia. Also, notice that you are 10 times slower than you should be because you are missing the CRUCIAL parameter ps_grid_spacing=50 in the job.ini.

alejocaldeGEM commented 9 months ago

Yes I am fully aware that using the full model is not recommended. Unfortunately, we reduced the model but the risk was being significantly underestimated (i.e. Full Model AALR = 0.14%, Reduced model AALR = 0.01%). This led me to believe that the technique used to reduce the model was introducing some problems in the event-based calculations. Hence, we had to go back to the full model as attached (happy to dicuss this further on another thread, as I have all the input files for that case too).

Coming back to the issue in Windows 11, here are the specifications of the machine and the OS were we experienced the issues:

Device specifications

Device name MAJA Processor Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz 2.59 GHz Installed RAM 32,0 GB (31,8 GB usable) Device ID 0AA9C217-0367-408B-86B2-C2C753EB3197 Product ID 00330-52641-62825-AAOEM System type 64-bit operating system, x64-based processor Pen and touch No pen or touch input is available for this display

Windows specifications:

Edition Windows 11 Pro Version 22H2 Installed on ‎31.‎3.‎2023. OS build 22621.2283 Experience Windows Feature Experience Pack 1000.22662.1000.0