Performance of open_sky function

observingClouds commented 1 year ago

I came across #68 and optimised the open_sky function for my own purposes and thought I share it here.

What would be left to do to before even considering this PR:

[x] testing functionality
[x] testing performance
[x] add numba to requirements.txt
[ ] find a way to add back the unsupported debug_plot
[ ] find a way to add back the NotImplementedError

observingClouds commented 1 year ago

I experimented with numba but it actually did not improve the performance. The refactoring of the function did however increase the performance by about 19000 times. I tested it for a random field of 100x1000 and got 1.05s and 54µs. I guess that's a win 🤣 Numba also does not improve the performance for larger fields (e.g. 10000x1000).

import numpy as np
import cloud metrics
EXAMPLE_MASK=np.random.randint(0,1,(100,1000))
%timeit cloudmetrics.mask.open_sky(EXAMPLE_MASK)
# 54 µs ± 627 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Okay, if I use a field with 0s and 1s I only get a factor of 2 🤣

import numpy as np
import cloud metrics
EXAMPLE_MASK=np.random.randint(0,2,(1000,1000))
%timeit cloudmetrics.mask.open_sky(EXAMPLE_MASK)
# 5.45 s ± 82.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note (to myself): Performance between numba and non-numba optimised version can be tested by setting the environment variable NUMBA_DISABLE_JIT. Testing both versions at the same time without restarting the python interpreter was not successful with the following script:

from numba.tests.support import override_config
with override_config('DISABLE_JIT',True):
    import cloudmetrics
    %timeit cloudmetrics.mask.open_sky(EXAMPLE_MASK)
    sys.modules.pop("cloudmetrics")
    del cloudmetrics

observingClouds commented 1 year ago

@martinjanssens @leifdenby please have a look. If you do not mind that I removed the debug_plot functionality and the NotImplementedError, then this is ready to be merged.

observingClouds commented 1 year ago

To be honest I'm not 100% sure but I think it is the replacement of some of the np.where functions.

cloudsci / cloudmetrics

Performance of open_sky function #72