Closed observingClouds closed 1 year ago
Okay, up on further investigation the "issue" is with _get_regionprops
. "issue", because you actually cache the regionprops. Could this caching maybe made optional? For large datasets it is just impractical to keep all these fields in memory.
Thanks for pointing this one out! We put that caching in for cases where you want to calculate multiple object-based metrics on the same cloud field, so you don't have to recompute the region properties for each metric. Two solutions I'd be happy with:
_get_regionprops
is usually an order of magnitude faster than computing the metric itself, so I don't think we'd lose that much performance (I don't quite know how it will scale, however).regions
for a unique object_labels
, instead of continuously expanding a _CAHCED_VALUES
dict) which keeps the performance, but builds in the assumption that your workflow would first have an outer loop over scenes and then an inner loop over metrics, otherwise you would recompute the region properties for each image for each metric again.Option 2 is easily implemented, I think. What do you think, @leifdenby ?
Thanks for your response. I like option 2, maybe implemented as a manual reset of the cache whenever the user wants to.
Great discussion! Sorry for being slow. I like @martinjanssens's PR to just remove this caching "feature" for now since it's more of a hindrance than a benefit at the moment. I suggest we merge that PR, close this issue and create a new issue to track the idea of maybe re-introducing caching at a later date (to properly do that we probably want a test benchmark that duplicates the issues @observingClouds found, so that we can ensure we actually get a speedup while not breaking things in future). What do you both think?
Sounds good to me!
And I agree too. So let's close this with the merging of #73.
There seems to be an issue with the (timely) release of memory allocation. The following calculation increases rapidly the used memory in an Jupyter Notebook environment until the kernel eventually dies due to overuse of memory.
I observe that the memory usage also depends on the field size, but is even larger than the field itself. This issue does not occur for some other tested metrics (cloud_fraction, label_objects), but it does seem to occur for
num_objects
as well.