Open peanutfun opened 2 months ago
Interesting! Just a few quick feedbacks:
Why would you do the line hazard.select(event_id=[event_id])
? This is very suboptimal I think as the ImpatCalc.impact
handles multiple events efficiently. You might already regain 5 seconds of the total 25 by not doing this.
The minimal_exp_gdf
is central for the case when the exposure is larger than the hazard.
This slicing takes significantly longer than the actual data multiplication.
This is only the case for small datasets if I remember correctly, but happy to improve that.
In brief, the current setup was obtained after trying to optimize computation times for a large number of different use cases (small/large exposures, small/large hazards). But there is certainly room for improvement! Maybe one could make the slicing inside of the ImpactCalc
optional? Or, actually what I had in mind while writing the module is to add other types of computation methods over time, here something like ImpactCalc.multi_exposures_impact
.
@chahank Could you supply the Jupyter script and data you used for benchmarking ImpactCalc
?
It is a bit messy, so let me clean it up and send it directly.
TLDR: Calling
ImpactCalc.impact
multiple times and concatenating the result is much slower than calling it once on a larger event set. This has some unfortunate implications for use cases where the exposure or the impact function changes between hazard events.Problem
In my use case I have a hazard event set of one event per year (flood footprints). I also have one exposure layer per year (WorldPop data). To compute an impact over all years, I must call
ImpactCalc
independently for each exposure layer, because it does not support multiple exposures. I then concatenate the result into a single impact.This was my general idea on how to implement this:
I was quite bummed to see that in this case, impact calculation took longer by a factor of more than 5, compared to an impact calculation with a single exposure and multiple hazard events. Note that a factor 5 might mean 5s vs 25s for a single impact calculation, but 5min vs 25min for a calibration.
Profiling results
I created a stub test case. It does not use real data, and in this case, the slowdown was "only" a factor of 2, see my code below. I made sure that I assigned centroids beforehand, I did not save the impact matrix, and I only profiled the actual impact calculation. Profiling visualized with
snakeviz
.Single exposure:
Multi-exposure:
Results:
imp_mat_gen
.Impact.from_eih
multiple timesImpactCalc.minimal_exp_gdf
multiple timesDiscussion points
ImpactCalc
multiple times has a strong performance penalty.Impact.from_eih
spends a lot of time extracting and stacking the exposure coordinates. It could simply store the exposure or its geometry column instead of copying the coordinates.minimal_exp_gdf
instead of just using the full exposure gdf?Hazard.select
andHazard.get_mdr
the intensity matrix is sliced both row- and column-wise. This slicing takes significantly longer than the actual data multiplication. While it is nice that the multiplication is fast with thecsr_matrix
, column-wise slicing is explicitly slow for this data type. Are we using the most efficient setup here?Code
https://polybox.ethz.ch/index.php/s/y2aqu3JX4w6qgfy