CLIMADA-project / climada_python

Python (3.8+) version of CLIMADA
GNU General Public License v3.0
316 stars 123 forks source link

Hazard.intensity in CSR format and not CSC format #938

Open mncosta opened 2 months ago

mncosta commented 2 months ago

Hi,

Not sure if this belongs here, but I have a question about Climada's performance and either I am missing something or I believe an improving could be made to the use of hazard intensities. Searching through the repositories' issues I only found a reference to this on #874, stating A significant amount of time is spent slicing the hazard intensity matrix, which is related to what I found.

As I found, currently Hazard.intensity is stored as a "row vector" spicy.sparse matrix in CSR format. I'm making this assumption from the provided example in here, specifically the following code section in Part 1:

%matplotlib inline
import numpy as np
from climada.hazard import Hazard
from climada.util.constants import HAZ_DEMO_FL
# to hide the warnings
import warnings
warnings.filterwarnings('ignore')

# read intensity from raster file HAZ_DEMO_FL and set frequency for the contained event
haz_ven = Hazard.from_raster([HAZ_DEMO_FL], attrs={'frequency':np.ones(1)/2}, haz_type='FL')
haz_ven.check()

Inspecting haz_ven.intensityyields the following format:

<1x1032226 sparse matrix of type '<class 'numpy.float32'>'
    with 161671 stored elements in Compressed Sparse Row format>

Yet, this appears to be a 1 row x 1032226 columns matrix, in essence making the CSR format not the most efficient when slicing/indexing is needed. Why are hazard intensities saved in CSR format and not CSC format? In essence this would making slicing much more efficient. Time-wise would be a massive improvement if hazard matrices are huge. Again, with the above hazard example and with a simple indexing in CSR vs. CSC format:

# taking some random indexes to slice
idx = np.random.randint(0, 1032226, 100)

# get hazard.intensity in csc format
csc_format = haz_ven.intensity.tocsc()
# profiling slicing
%timeit csc_format[:, idx]
%timeit haz_ven.intensity[:, idx]

Which yields:

58.3 µs ± 71.5 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
711 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

This is ~12 times faster for this very small problem. In the "real" problem I'm currently working I can get even a bigger speed up with a matrix of shape 1x1072057600, which takes about ~4.5s to index ~20k cells, whereas with CSC format I can slice in about 400 µs.

Thus, why are intensities stored in this manner? Do you require them to be this way for other calculations? Is there a faster way to slice intensities given an array of centroids that I'm missing?

chahank commented 2 months ago

Dear Miguel,

CLIMADA is originally built to be able to handle large amounts of events (think of large probabilistic sets). Thus, slicing over events is important in that setting. Only recently have applications with a focus on singular large events emerged for which a centroid slicing might be more important.