Closed micheles closed 4 years ago
The trick to make this work is to store only the highest losses per asset, i.e. the tail of the distribution instead of the full distribution.
Can we consider adding pandas as a dependancy? Some of the functionality pandas provides that could be very useful for dealing with large damage tables and loss tables, from the package overview:
Pandas also provides data structures for efficiently storing sparse data.
I actually support this, independently from the event_based_damage project, it makes sense for us to have a "batteries included" OpenQuake distribution.
+1 to this. Several additional toolkits building on OQ use Pandas, so there are few situations where one may use OQ in a distribution without Pandas. Any possibility there could be speed-ups in, for example, the site collection object if it worked as a Pandas dataframe?
I sincerely doubt that using Pandas will speedup the site collection. Providing Pandas will make it easier to explore the datastore (see https://github.com/gem/oq-engine/pull/5357) and to perform post-processing analysis, but for the time being I would refrain from using it in the core engine. I do not trust it as much as I trust numpy, especially performance-wise.
Probably not for event based, but certainly yes for scenario_risk, and it is worth testing the feasibility of the approach.
PS: after 11 days of work, it turns out that the approach is viable even for ebrisk, if we discard the smallest losses, which affect only the low period portion of the loss curves.