Make caching of regionprops optional

observingClouds commented 1 year ago

There seems to be an issue with the (timely) release of memory allocation. The following calculation increases rapidly the used memory in an Jupyter Notebook environment until the kernel eventually dies due to overuse of memory.

import numpy as np
import cloudmetrics

ds_test = np.random.choice([True,False],(100,100))

for i in tqdm.tqdm(range(1000)):
    _ = cloudmetrics.mask.iorg_objects(ds_test, periodic_domain=False, reference_dist='poisson')

I observe that the memory usage also depends on the field size, but is even larger than the field itself. This issue does not occur for some other tested metrics (cloud_fraction, label_objects), but it does seem to occur for num_objects as well.

observingClouds commented 1 year ago

Okay, up on further investigation the "issue" is with _get_regionprops. "issue", because you actually cache the regionprops. Could this caching maybe made optional? For large datasets it is just impractical to keep all these fields in memory.

martinjanssens commented 1 year ago

Thanks for pointing this one out! We put that caching in for cases where you want to calculate multiple object-based metrics on the same cloud field, so you don't have to recompute the region properties for each metric. Two solutions I'd be happy with:

We make the caching optional, as you say, or remove it entirely. _get_regionprops is usually an order of magnitude faster than computing the metric itself, so I don't think we'd lose that much performance (I don't quite know how it will scale, however).
We build in a mechanism where the cache is released when you move on to a "next image" (i.e. just cache regions for a unique object_labels, instead of continuously expanding a _CAHCED_VALUES dict) which keeps the performance, but builds in the assumption that your workflow would first have an outer loop over scenes and then an inner loop over metrics, otherwise you would recompute the region properties for each image for each metric again.

Option 2 is easily implemented, I think. What do you think, @leifdenby ?

observingClouds commented 1 year ago

Thanks for your response. I like option 2, maybe implemented as a manual reset of the cache whenever the user wants to.

leifdenby commented 1 year ago

Great discussion! Sorry for being slow. I like @martinjanssens's PR to just remove this caching "feature" for now since it's more of a hindrance than a benefit at the moment. I suggest we merge that PR, close this issue and create a new issue to track the idea of maybe re-introducing caching at a later date (to properly do that we probably want a test benchmark that duplicates the issues @observingClouds found, so that we can ensure we actually get a speedup while not breaking things in future). What do you both think?

observingClouds commented 1 year ago

Sounds good to me!

martinjanssens commented 1 year ago

And I agree too. So let's close this with the merging of #73.

cloudsci / cloudmetrics

Make caching of regionprops optional #70