Can we move the calculation of aggregated event loss tables in postprocessing?

gem / oq-engine

OpenQuake Engine: a software for Seismic Hazard and Risk Analysis

https://github.com/gem/oq-engine/#openquake-engine

GNU Affero General Public License v3.0

383 stars 277 forks source link

Can we move the calculation of aggregated event loss tables in postprocessing? #5343

Closed micheles closed 4 years ago

micheles commented 4 years ago

Probably not for event based, but certainly yes for scenario_risk, and it is worth testing the feasibility of the approach.

PS: after 11 days of work, it turns out that the approach is viable even for ebrisk, if we discard the smallest losses, which affect only the low period portion of the loss curves.

micheles commented 4 years ago

The trick to make this work is to store only the highest losses per asset, i.e. the tail of the distribution instead of the full distribution.

raoanirudh commented 4 years ago

Can we consider adding pandas as a dependancy? Some of the functionality pandas provides that could be very useful for dealing with large damage tables and loss tables, from the package overview:

Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
Intuitive merging and joining data sets
Flexible reshaping and pivoting of data sets

Pandas also provides data structures for efficiently storing sparse data.

micheles commented 4 years ago

I actually support this, independently from the event_based_damage project, it makes sense for us to have a "batteries included" OpenQuake distribution.

g-weatherill commented 4 years ago

+1 to this. Several additional toolkits building on OQ use Pandas, so there are few situations where one may use OQ in a distribution without Pandas. Any possibility there could be speed-ups in, for example, the site collection object if it worked as a Pandas dataframe?

micheles commented 4 years ago

I sincerely doubt that using Pandas will speedup the site collection. Providing Pandas will make it easier to explore the datastore (see https://github.com/gem/oq-engine/pull/5357) and to perform post-processing analysis, but for the time being I would refrain from using it in the core engine. I do not trust it as much as I trust numpy, especially performance-wise.