cloudsci / cloudmetrics

Toolkit for computing 15+ metrics characterising 2D cloud patterns
16 stars 8 forks source link

Make caching of regionprops optional #70

Closed observingClouds closed 1 year ago

observingClouds commented 1 year ago

There seems to be an issue with the (timely) release of memory allocation. The following calculation increases rapidly the used memory in an Jupyter Notebook environment until the kernel eventually dies due to overuse of memory.

import numpy as np
import cloudmetrics

ds_test = np.random.choice([True,False],(100,100))

for i in tqdm.tqdm(range(1000)):
    _ = cloudmetrics.mask.iorg_objects(ds_test, periodic_domain=False, reference_dist='poisson')

I observe that the memory usage also depends on the field size, but is even larger than the field itself. This issue does not occur for some other tested metrics (cloud_fraction, label_objects), but it does seem to occur for num_objects as well.

observingClouds commented 1 year ago

Okay, up on further investigation the "issue" is with _get_regionprops. "issue", because you actually cache the regionprops. Could this caching maybe made optional? For large datasets it is just impractical to keep all these fields in memory.

martinjanssens commented 1 year ago

Thanks for pointing this one out! We put that caching in for cases where you want to calculate multiple object-based metrics on the same cloud field, so you don't have to recompute the region properties for each metric. Two solutions I'd be happy with:

Option 2 is easily implemented, I think. What do you think, @leifdenby ?

observingClouds commented 1 year ago

Thanks for your response. I like option 2, maybe implemented as a manual reset of the cache whenever the user wants to.

leifdenby commented 1 year ago

Great discussion! Sorry for being slow. I like @martinjanssens's PR to just remove this caching "feature" for now since it's more of a hindrance than a benefit at the moment. I suggest we merge that PR, close this issue and create a new issue to track the idea of maybe re-introducing caching at a later date (to properly do that we probably want a test benchmark that duplicates the issues @observingClouds found, so that we can ensure we actually get a speedup while not breaking things in future). What do you both think?

observingClouds commented 1 year ago

Sounds good to me!

martinjanssens commented 1 year ago

And I agree too. So let's close this with the merging of #73.