sometimes disaggregation is much slower if epsilon_star = True

For disaggregation of single sites by MDE, the calculation is sometimes significantly slower when using epsilon_star=True. The table below shows the time required to run the same job on three different computers and the oq version, with or without epsilon_star. When not all sources were used, the percentage of sources is controlled using OQ_SOURCES_SAMPLE=X oq engine --run .... The reported times come from what is printed in the console, but we confirm that the difference occurs in disaggregation from the reports.

When the job is reduced, the overall time difference becomes smaller but is still distinct in the total compute_disagg. For example, in the smallest test case:

**epsilon_star = True**
+------------------------------+-----------+-----------+--------+
| calc_5553, maxmem=0.7 GB     | time_sec  | memory_mb | counts |
+------------------------------+-----------+-----------+--------+
| DisaggregationCalculator.run | 37.7      | 336.9     | 1      |
+------------------------------+-----------+-----------+--------+
| ClassicalCalculator.run      | 29.9      | 356.9     | 1      |
+------------------------------+-----------+-----------+--------+
...
+------------------------------+-----------+-----------+--------+
| total compute_disagg         | 9.47083   | 1.38086   | 15     |

**without epsilon_star**
+------------------------------+-----------+-----------+--------+
| calc_5552, maxmem=0.6 GB     | time_sec  | memory_mb | counts |
+------------------------------+-----------+-----------+--------+
| DisaggregationCalculator.run | 32.5      | 285.8     | 1      |
+------------------------------+-----------+-----------+--------+
| ClassicalCalculator.run      | 31.2      | 305.5     | 1      |
+------------------------------+-----------+-----------+--------+
... 
+------------------------------+-----------+-----------+--------+
| total compute_disagg         | 1.37318   | 1.20703   | 15     |

I will share job files separately.

Here is how to profile a disaggregation calculation.

Run oq run job.ini -p calculation_mode=classical and write down the calculation ID (say it is 1234).
Run oq run job.ini -p epsilon_star=true --hc 1234 -c0 -s100 > true.txt
Run oq run job.ini -p epsilon_star=false --hc 1234 -c0 -s100 > false.txt
Compare the profiler information between true.txt and false.txt

By doing that I see that all the time is spent in the scipy function _truncnorm_sf_scalar. It could be that _truncnorm_sf_scalar has special optimizations for macOS. Perhaps we could rewrite _truncnorm_sf_scalar by using numba (we have some function that looks similar called truncnorm_sf in https://github.com/gem/oq-engine/blob/master/openquake/hazardlib/stats.py#L38), ~but frankly it is a lot of work~. I would do nothing for the moment, but I am pretty sure I can do better than scipy (I looked at the code: that part is pure Python and not using any vectorization at all).

gem / oq-engine

sometimes disaggregation is much slower if epsilon_star = True #8346