cffdrs / cffdrs_py

Python Module for CFFDRS
GNU General Public License v2.0
6 stars 3 forks source link

Implement BUI raster calculation #5

Closed conbrad closed 2 months ago

conbrad commented 3 months ago

This PR introduces an implementation for the BUI raster calculation leveraging numpy arrays for raster inputs and output.

Implementation/Performance:

With np.vectorize:

>>> cProfile.run('bui_raster(np.array([996.9]), np.array([874.8]))')
         29 function calls in 0.000 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        2    0.000    0.000    0.000    0.000 buildup_index.py:3(bui)
        1    0.000    0.000    0.000    0.000 function_base.py:2337(_call_as_normal)
        1    0.000    0.000    0.000    0.000 function_base.py:2367(__call__)
        1    0.000    0.000    0.000    0.000 function_base.py:2374(_get_ufunc_and_otypes)
        1    0.000    0.000    0.000    0.000 function_base.py:2404(<listcomp>)
        3    0.000    0.000    0.000    0.000 function_base.py:2405(<genexpr>)
        1    0.000    0.000    0.000    0.000 function_base.py:2409(<listcomp>)
        1    0.000    0.000    0.000    0.000 function_base.py:2433(<listcomp>)
        1    0.000    0.000    0.000    0.000 function_base.py:2443(_vectorize_call)
        1    0.000    0.000    0.000    0.000 function_base.py:2453(<listcomp>)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.any}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        2    0.000    0.000    0.000    0.000 {built-in method numpy.array}
        3    0.000    0.000    0.000    0.000 {built-in method numpy.asanyarray}
        3    0.000    0.000    0.000    0.000 {built-in method numpy.asarray}
        1    0.000    0.000    0.000    0.000 {built-in method numpy.frompyfunc}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}

With the vanilla numpy implementation:

>>> cProfile.run('buildup_index(np.array([874.8]), np.array([996.9]))')
         10 function calls in 0.000 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 buildup_index_raster.py:4(buildup_index)
        4    0.000    0.000    0.000    0.000 multiarray.py:346(where)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.exec}
        2    0.000    0.000    0.000    0.000 {built-in method numpy.array}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Structure:

Tooling:

Tests:

jordan-evens commented 3 months ago

Looks like both of your tests are showing a duration of 0 seconds? If you run enough iterations to show a measureable time, how big is the difference in execution speed?

I would really prefer to not have a type-specific implementation of each index if just applying the original function to a data structure isn't meaningfully slower, because it's just complicating things and defeating the purpose of this being a reference library. If you can show that there's a considerable performance impact in applying vs writing a specific function for a data type then it might make sense to include a type-specific implementation.

Can you please run the profiling over enough iterations to see if there's a concrete difference in performance?

conbrad commented 3 months ago

Looks like both of your tests are showing a duration of 0 seconds? If you run enough iterations to show a measureable time, how big is the difference in execution speed?

I would really prefer to not have a type-specific implementation of each index if just applying the original function to a data structure isn't meaningfully slower, because it's just complicating things and defeating the purpose of this being a reference library. If you can show that there's a considerable performance impact in applying vs writing a specific function for a data type then it might make sense to include a type-specific implementation.

Can you please run the profiling over enough iterations to see if there's a concrete difference in performance?

Yes, sorry for not making that more clear, the profile is of a scalar value test from the R test suite and rips through it so fast the measured time is reported as 0, however the function call overhead is 3x that of the numpy implementation. Point taken though, I'll rerun the profile with one of the raster tests to see if that overhead slows things down significantly.

jordan-evens commented 3 months ago

Thanks. If you aren't running enough iterations to at least measure the difference in seconds then who knows what random thing might have incidentally run at the time and impacted performance. For all we know the garbage collector ran after the first test and that's why the second was slower.

conbrad commented 3 months ago

Thanks. If you aren't running enough iterations to at least measure the difference in seconds then who knows what random thing might have incidentally run at the time and impacted performance. For all we know the garbage collector ran after the first test and that's why the second was slower.

Yep totally, I got distracted by seeing 3x the number of function calls. Will get some real raster data running through multiple profile runs today.

conbrad commented 2 months ago

@jordan-evens commit https://github.com/cffdrs/cffdrs_py/pull/5/commits/4d8f239aad53b92269104038f84cac1c9789501e adds profiling code and output against real raster data, showing the difference in overhead between the numpy raster implementation and the vectorized version that wraps the current BUI function.

The vectorized version adds significant overhead (vec_2000m_profile_iter_x.txt files):

The numpy version in comparison (2000m_profile_iter_x.txt files):

We'll likely be running calculations on 100 meter resolution tif files so this overhead would create noticeable latency in our application end to end.

jordan-evens commented 2 months ago

I'm thinking np.vectorize is crashing because you're trying to load the entire array into memory when it calls it, since I couldn't make an array with np.random.rand(10000, 10000) without it crashing.

Have you considered using numba instead? If you're just worried about execution speed it might be the better option.

I screwed around with some tests and, unless I've done something wrong, it seems to be faster than the custom numpy code:

import timeit
import numba as nb
import numpy as np
import numpy.typing as npt

# not sure why arguments are reversed so fix that
# def buildup_index(dc: npt.NDArray[np.float64], dmc: npt.NDArray[np.float64]):
def buildup_index(dmc: npt.NDArray[np.float64], dc: npt.NDArray[np.float64]):
    # Eq. 27a
    # Calculate the numerator and denominator
    numerator = 0.8 * dc * dmc
    denominator = dmc + 0.4 * dc 

   # Perform the division with np.divide and specify the condition with where clause to avoid division by zero
    bui1 = np.divide(numerator, denominator, where=denominator != 0, out=np.zeros_like(numerator, dtype=float))

    # Eq. 27b - next 3 lines
    p = np.where(dmc == 0, 0, (dmc - bui1) / dmc)
    cc = 0.92 + ((0.0114 * dmc) ** 1.7)
    bui0 = dmc - cc * p

    # Constraints
    bui0 = np.where(bui0 < 0, 0, bui0)
    bui1 = np.where(bui1 < dmc, bui0, bui1)

    return bui1

def bui(dmc, dc):
    # was comparing version of this that checks bounds to numpy version that doesn't
    if dmc < 0:
        raise ValueError(f'Invalid dmc: {dmc}')
    if dc < 0:
        raise ValueError(f'Invalid dc: {dc}')
    # Eq. 27a
    bui1 = 0 if (dmc == 0 and dc == 0) else (0.8 * dc * dmc / (dmc + 0.4 * dc))
    # Eq. 27b - next 3 lines
    p = 0 if (dmc == 0) else ((dmc - bui1) / dmc)
    cc = 0.92 + ((0.0114 * dmc) ** 1.7)
    bui0 = dmc - cc * p
    # Constraints
    bui0 = 0 if (bui0 < 0) else bui0
    bui1 = bui0 if (bui1 < dmc) else bui1
    return bui1

def bui_nocheck(dmc, dc):
    # was comparing version of this that checks bounds to numpy version that doesn't
    # if dmc < 0:
    #     raise ValueError(f'Invalid dmc: {dmc}')
    # if dc < 0:
    #     raise ValueError(f'Invalid dc: {dc}')
    # Eq. 27a
    bui1 = 0 if (dmc == 0 and dc == 0) else (0.8 * dc * dmc / (dmc + 0.4 * dc))
    # Eq. 27b - next 3 lines
    p = 0 if (dmc == 0) else ((dmc - bui1) / dmc)
    cc = 0.92 + ((0.0114 * dmc) ** 1.7)
    bui0 = dmc - cc * p
    # Constraints
    bui0 = 0 if (bui0 < 0) else bui0
    bui1 = bui0 if (bui1 < dmc) else bui1
    return bui1

if __name__ == "__main__":
    DMC_MAX = 300
    DC_MAX = 900

    def do_time(fct, name, iterations=10):
        t = timeit.timeit(lambda: fct(dmc_array, dc_array), number=iterations)
        print(f"{iterations} iterations on {len(dmc_array)}x{len(dmc_array[0])} array for {name} takes {t}")
        return t

    bui_np = np.vectorize(bui)
    bui_nb = nb.vectorize(bui)

    bui_np_nc = np.vectorize(bui_nocheck)
    bui_nb_nc = nb.vectorize(bui_nocheck)

    for n in [100, 1000]:
        rows = n
        cols = n

        print(f"Creating input data of size {rows}x{cols}")
        dmc_array = np.random.rand(rows, cols) * DMC_MAX
        dc_array = np.random.rand(rows, cols) * DC_MAX

        def fct_list(dmc, dc):
            return np.array([np.array([bui(dmc[i][j], dc[i][j]) for j in range(len(dmc[i]))]) for i in range(len(dmc))])

        def make_apply(fct):
            # need to stack backwards
            return lambda dmc, dc: np.apply_along_axis(lambda x: fct(x[0], x[1]), 0, np.stack((dc, dmc)))

        do_time(buildup_index, "np")
        do_time(bui_np, "np.vectorize")
        do_time(bui_np_nc, "np.vectorize without bounds check")
        do_time(bui_nb, "nb.vectorize")
        do_time(bui_nb_nc, "nb.vectorize without bounds check")

    r0 = buildup_index(dmc_array, dc_array)
    r1 = bui_np(dmc_array, dc_array)
    r2 = bui_nb(dmc_array, dc_array)
    assert(np.equal(r0, r1).all())
    assert(np.equal(r0, r2).all())
    assert(np.equal(r1, r2).all())
    print("Validated that outputs are the same, so test performance now")

    for iterations in [10, 100, 1000]:
        print(f"Testing np vs np.vectorize vs nb.vectorize for {iterations} iterations")
        # now that we have the two best options, do some real comparison
        do_time(buildup_index, "np", iterations=iterations)
        do_time(bui_np, "np.vectorize", iterations=iterations)
        do_time(bui_nb, "nb.vectorize", iterations=iterations)

which gives me:

Creating input data of size 100x100
10 iterations on 100x100 array for np takes 0.0034444050033926032
10 iterations on 100x100 array for np.vectorize takes 0.05432502899930114
10 iterations on 100x100 array for np.vectorize without bounds check takes 0.05028547500114655
10 iterations on 100x100 array for nb.vectorize takes 3.817805860999215
10 iterations on 100x100 array for nb.vectorize without bounds check takes 0.15711500700126635
Creating input data of size 1000x1000
10 iterations on 1000x1000 array for np takes 0.35230661599780433
10 iterations on 1000x1000 array for np.vectorize takes 6.160964012000477
10 iterations on 1000x1000 array for np.vectorize without bounds check takes 5.73731357199722
10 iterations on 1000x1000 array for nb.vectorize takes 0.25733804699848406
10 iterations on 1000x1000 array for nb.vectorize without bounds check takes 0.1448044650023803
Validated that outputs are the same, so test performance now
Testing np vs np.vectorize vs nb.vectorize for 10 iterations
10 iterations on 1000x1000 array for np takes 0.3038933070056373
10 iterations on 1000x1000 array for np.vectorize takes 6.244716456996684
10 iterations on 1000x1000 array for nb.vectorize takes 0.2372354000035557
Testing np vs np.vectorize vs nb.vectorize for 100 iterations
100 iterations on 1000x1000 array for np takes 3.2701099400001112
100 iterations on 1000x1000 array for np.vectorize takes 62.02885445800348
100 iterations on 1000x1000 array for nb.vectorize takes 2.274711901001865
Testing np vs np.vectorize vs nb.vectorize for 1000 iterations
1000 iterations on 1000x1000 array for np takes 34.07073844700062
1000 iterations on 1000x1000 array for np.vectorize takes 613.1922253219964
1000 iterations on 1000x1000 array for nb.vectorize takes 22.724554789994727

The iterations at the end are using the original bui code, but you're comparing it to a version that doesn't check if the values are < 0, so a more direct comparison is:

10 iterations on 1000x1000 array for np takes 0.30092410200450104
10 iterations on 1000x1000 array for nb.vectorize without checks takes 0.14627075899625197
100 iterations on 1000x1000 array for np takes 3.39136324699939
100 iterations on 1000x1000 array for nb.vectorize without checks takes 1.4615159170061816
1000 iterations on 1000x1000 array for np takes 31.8209274779947
1000 iterations on 1000x1000 array for nb.vectorize without checks takes 15.389702129003126

I don't have your rasters or anything, so can you please try running your tests through with nb.vectorize and see how that performs?

conbrad commented 2 months ago

Oh boy forgot about numba! Thanks for reminding me and giving me an excuse to try it out. Results incoming, stay tuned...

I'm thinking np.vectorize is crashing because you're trying to load the entire array into memory when it calls it, since I couldn't make an array with np.random.rand(10000, 10000) without it crashing.

I can run the numpy version to completion fine with the large raster so I don't think that's it but we're probably on different machines so likely that's why it fails for you. I imagine the failure I'm seeing specifically with the np.vectorized version is memory exhaustion due to the function call overhead, or I'm totally wrong and it's not related to memory, was hard to find a definitive source for the 124 exit code.

not sure why arguments are reversed so fix that

Will fix that, thanks!

BadgerOnABike commented 2 months ago

My only other question would be, is it worth managing these as rasters through something like rasterio? Unless that breaks it down to its base array internally? My curiosity being that of "can we keep the object as close to its original form". This may also just be a hang over from how I have handled things within R.

I could be way out in left field picking dandelions too though.

conbrad commented 2 months ago

Also note that the rasters I'm using are also committed in this branch if you want to pull it and try it out. I haven't committed the 14GB ones though. 😅

My only other question would be, is it worth managing these as rasters through something like rasterio? Unless that breaks it down to its base array internally? My curiosity being that of "can we keep the object as close to its original form". This may also just be a hang over from how I have handled things within R.

Rasterio internally uses numpy arrays, it's numpy all the way down 😬 https://github.com/rasterio/rasterio/blob/24d0845e576158217f6541c3c81b163d873a994d/rasterio/_io.pyx#L966

conbrad commented 2 months ago

@jordan-evens , @BadgerOnABike , apologies if you saw my earlier comment I had to ninja edit out -- I messed up my profile test in commit https://github.com/cffdrs/cffdrs_py/pull/5/commits/a2b8d0e7533217130614c2c167e581c2c5b8730f but the commit with the proper profile runs of numba show it's smoking fast... https://github.com/cffdrs/cffdrs_py/pull/5/commits/441394f4f01dffd89899c23b30a434eedf13bbc8

Tests pass too.

Wow.

         14 function calls in 0.005 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.005    0.005    0.005    0.005 dufunc.py:189(__call__)
        1    0.000    0.000    0.000    0.000 {built-in method io.open}
        1    0.000    0.000    0.000    0.000 pstats.py:107(__init__)
        1    0.000    0.000    0.000    0.000 pstats.py:117(init)
        1    0.000    0.000    0.000    0.000 pstats.py:136(load_stats)
        1    0.000    0.000    0.000    0.000 cProfile.py:50(create_stats)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
        1    0.000    0.000    0.000    0.000 codecs.py:186(__init__)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        1    0.000    0.000    0.000    0.000 {method 'values' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
conbrad commented 2 months ago

@jordan-evens , @BadgerOnABike , I've tested numba in our cluster and it runs well, so I've removed the vanilla numpy and the vectorized numpy raster implementations, and cleaned up some of the surrounding profiling code. I added build instructions for local dev setup in the README as well.

Let me know if you'd like any other changes to the PR.

BadgerOnABike commented 2 months ago

@jordan-evens and I have been chatting about intent and eligibility for ingestion a bit so we have a request.

The raster implementation of BUI is a great proof of concept as to what raster work within CFFDRS_py would look like and we're curious the overall plan for how you intend to use this. From our conversation we came to the conclusion that providing a function that would rasterize all of the CFFDRS FWI indexes and provide them as an object would constitute a mergeable outcome into the project. This would be consistent with cffdrs_r as there are no single index raster implementations as, a person did could wrap a base function with raster functionality is relatively easily. Where the greatest utility to the end user appears to be is a single function to perform all FWI calculations and returning an object with all possible indexes. This also ensures that any rasterization of indexes is done consistently within the module.

Is that in step with the final planned outcome for this work @conbrad?

Additionally, do you know if there are any impacts of decorating the function with vectorize on the point initialization functions? I'd think not, but I've been surprised by strange behaviour in the past!

conbrad commented 2 months ago

@jordan-evens and I have been chatting about intent and eligibility for ingestion a bit so we have a request.

The raster implementation of BUI is a great proof of concept as to what raster work within CFFDRS_py would look like and we're curious the overall plan for how you intend to use this. From our conversation we came to the conclusion that providing a function that would rasterize all of the CFFDRS FWI indexes and provide them as an object would constitute a mergeable outcome into the project. This would be consistent with cffdrs_r as there are no single index raster implementations as, a person did could wrap a base function with raster functionality is relatively easily. Where the greatest utility to the end user appears to be is a single function to perform all FWI calculations and returning an object with all possible indexes. This also ensures that any rasterization of indexes is done consistently within the module.

Is that in step with the final planned outcome for this work @conbrad?

Additionally, do you know if there are any impacts of decorating the function with vectorize on the point initialization functions? I'd think not, but I've been surprised by strange behaviour in the past!

As discussed offline, we're able to vectorize the BUI function from our application code using numba so we don't need to make any changes to the library. Thanks folks for your time and input!