MarionBWeinzierl commented 1 year ago

In the profiling ( #81 ) it was found that one bottleneck seems to be the function https://github.com/ImperialCollegeLondon/pyrealm/blob/fcbba0eead6857fd412522ff7299af03c4d31b95/pyrealm/pmodel/functions.py#L671 .

@davidorme explains that there are various, sometimes very simple alternatives for this calculation.

It would be useful to implement the option to switch between different variants. These could be profiled separately, and the user can then be given the option to trade speed against precision.

davidorme commented 1 year ago

The same is true for calc_density_h2o and I need to implement a new method for that for compatibility with the SPLASH benchmarks. So I've just pushed a commit to feature/splash_v1 that adds various alternate approaches to calculating that new method (Chen et al) which has been revealing.

With that commit (https://github.com/ImperialCollegeLondon/pyrealm/pull/69/commits/14faef79b1329333e8ef37866e1e9d243b1a98b8), I've timed the various approaches:


import timeit
import numpy as np
import matplotlib.pyplot as plt

func_alts = {"simple", "chen_allinone", "chen_cumulative", 'chen_matrix'}
sizes = np.array([1, 5, 10, 25, 50, 100])
run_times = {k: np.full(6, fill_value=np.nan) for k in func_alts}

for fn in func_alts:
    call = f"density_h2o_{fn}(tc, patm)"
    for idx, sz in enumerate(sizes):
        setup = (f"from pyrealm.pmodel.functions import density_h2o_{fn}\n"
                f"import numpy as np\n"
                f"tc = np.random.uniform(0,50, ({sz},{sz}))\n"
                f"patm = np.random.uniform(90000, 120000, ({sz},{sz}))\n")

        run_times[fn][idx] = timeit.timeit(call, setup=setup, number=200)

for fn in func_alts:
    print(fn)
    plt.plot(sizes, run_times[fn], label=fn)

plt.legend()
plt.show()

The matrix approach using np.pow and np.sum sucks basically - it is by far the slowest and scales the worst with size. Interestingly, the original implementation (density_h2o_cumulative) using cumulative calculation using += of the polynomial terms seems best, even faster than expressing the whole term as one allinone equation to avoid repeated assignment. Obviously the simple calculation is the fastest but that may be woefully inaccurate. Annoyingly, the original implementation of calc_viscosity_h20 also used a similar cumulative approach and I swapped it out for the ever so clever matrix calculation. I am both surprised and chastened - anyone know why this the case?

Figure_1

a-smith-github commented 1 year ago

It looks to me like a lot of memory is being allocated and thrown away and reallocated again in the big np commands. I'll try and have a deeper look tomorrow but I think this also explains the memory allocation issue we're seeing with the big data set in the CI. If I cut the dataset down to a smaller value we can get further but we hit that limit again later on in that method. This is perhaps not unexpected but we should see what we can do in this area.

Apologies, this was a quick note to capture some thoughts before I forget them.

davidorme commented 1 year ago

Yeah - I haven't even looked at the memory allocation here. Is there a simple timeit style option for doing that? We already have what looks like a compelling argument for moving back to the simple cumulative approach!

davidorme commented 1 year ago

In the short term, @a-smith-github , I'd forgotten that we can already switch to the simple calculation. It's a bit obscure but the const argument to PModelEnvironment provides a whole bunch of constants/parameters/settings and we can adjust the test_profiling.py script:

https://github.com/ImperialCollegeLondon/pyrealm/blob/026a4f3c2e3ce5e0f62ff51fd5eeecf5b64215a5/tests/pmodel/test_profiling.py#L50-L52

To

from pyrealm.constants import PModelConst
const = PModelConst(simple_viscosity=True)
pm_env = PModelEnvironment(tc=tc, patm=patm, vpd=vpd, co2=co2, const=const)

All objects created from that pm_env object should then inherit the setting to use simple viscosity.

davidorme commented 1 year ago

OK - I have started a new branch from this issue to update functions in develop. With commit a717e14, I have implemented separate functions for two methods (chen and fisher) using both matrix algebra and straight multiplication to get powers of tc. Running the comparisons shows that straight multiplication is miles better in both cases, so I will update develop to add the new chen method and update the calculations.

I can't find an acceptable simple calculation for density at the moment, so let's see how this update affects the profiling.

The comparison from that branch shows:

import timeit
import numpy as np
import matplotlib.pyplot as plt

func_alts = {"chen", "chen_matrix", 'fisher', "fisher_matrix"}
sizes = np.array([1, 5, 10, 25, 50, 100])
run_times = {k: np.full(6, fill_value=np.nan) for k in func_alts}

for fn in func_alts:
    call = f"calc_density_h2o_{fn}(tc, patm)"
    for idx, sz in enumerate(sizes):
        setup = (f"from pyrealm.pmodel.functions import calc_density_h2o_{fn}\n"
                f"import numpy as np\n"
                f"tc = np.random.uniform(0,50, ({sz},{sz}))\n"
                f"patm = np.random.uniform(90000, 120000, ({sz},{sz}))\n")

        run_times[fn][idx] = timeit.timeit(call, setup=setup, number=200)

for fn in func_alts:
    plt.plot(sizes, run_times[fn], label=fn)

plt.tight_layout()
plt.legend()
plt.show()

matrix_vs_mult

davidorme commented 1 year ago

I have just pushed another commit (0283d75e070fc125c7413a41c988834c5de3e860) to the branch that adds a new version of calc_viscosity_h2o that avoids using matrix multiplication. This also makes a big difference to run time, although not quite as marked as the density example:

import timeit
import numpy as np
import matplotlib.pyplot as plt

func_alts = {"h2o", "h2o_matrix"}
sizes = np.array([1, 5, 10, 25, 50, 100])
run_times = {k: np.full(6, fill_value=np.nan) for k in func_alts}

for fn in func_alts:
    call = f"calc_viscosity_{fn}(tc, patm)"
    for idx, sz in enumerate(sizes):
        setup = (f"from pyrealm.pmodel.functions import calc_viscosity_{fn}\n"
                f"import numpy as np\n"
                f"tc = np.random.uniform(0,50, ({sz},{sz}))\n"
                f"patm = np.random.uniform(90000, 120000, ({sz},{sz}))\n")

        run_times[fn][idx] = timeit.timeit(call, setup=setup, number=200)

for fn in func_alts:
    plt.plot(sizes, run_times[fn], label=fn)

plt.tight_layout()
plt.legend()
plt.show()

matrix_vs_mult_viscosity

I will again update the branch to remove the slow matrix implementation.

MarionBWeinzierl commented 1 year ago

Some memory profiling using the python memory-profiler

MarionBWeinzierl commented 1 year ago

I also get

pyrealm/subdaily.py:305: RuntimeWarning: divide by zero encountered in divide pyrealm/subdaily.py:305: RuntimeWarning: invalid value encountered in divide

davidorme commented 1 year ago

Thanks - will have to track down what inputs are causing those warnings. What is the code used to invoke the memory profiling? It would be good to see how the matrix and non-matrix methods compare.

MarionBWeinzierl commented 1 year ago

Thanks - will have to track down what inputs are causing those warnings. What is the code used to invoke the memory profiling? It would be good to see how the matrix and non-matrix methods compare.

Just import the library, and then put the decorator @profile before the function. For the plots, use, for example mprof run python3 tests/pmodel/test_profiling.py (I just used that and made it a script by adding test_profiling_example() to the bottom), and then mprof plot.

See also https://pypi.org/project/memory-profiler/, where it is all described. I just could not really get any additional info using --include-children so far.

MarionBWeinzierl commented 1 year ago

I am now running the profiling for the methods calc_viscosity_h2o, calc_viscosity_h2o_matrix, calc_density_h2o_fisher and calc_density_h2o_chen separately.

Note that I needed to increase the size of tc and patm significantly to receive meaningful results. I used

dim=2000
tc = np.random.uniform(0,50,(dim,dim))
patm = np.random.uniform(90000, 120000, (dim,dim))

(When I ran it with dim=10000, it ran for all methods except calc_viscosity_h2o_matrix, for which it crashed.)

These are the profiles:

calc_viscosity_h2o

calc_viscosity_h2o_matrix

calc_density_h2o_fisher

calc_density_h2o_chen

davidorme commented 1 year ago

So the matrix method is both slower and uses considerably more memory?

MarionBWeinzierl commented 1 year ago

Significantly more, yes.

a-smith-github commented 1 year ago

Makes sense it's the np.outer on 925 that gets flagged on profiling, great the allocation and deallocation that you can see on htop is basicallly attributable to one line.

Am I right in thinking that rbar is a constant float? I wonder if we could re-formulate this section in some way so we can avoid creating such a large data structure. Basically can we avoid doing an np.outer and get the result of np.sum(mu1, axis=2) some other way (i.e. can we drop a dimension?) so we can do the division on line 926? Imagine the np.outer on L926 might need to stay but might get some speedup. Been a while since I've done any work with matrices...

davidorme commented 1 year ago

100 replaces the `outer` calls completely with simple sequential sums. I suspect that there is a more efficient formulation - lets see how this new implementation affects the runtime breakdown.

ImperialCollegeLondon / pyrealm

Give the option to choose between different functions in calc_viscosoty_h2o #96

100 replaces the `outer` calls completely with simple sequential sums. I suspect that there is a more efficient formulation - lets see how this new implementation affects the runtime breakdown.

ImperialCollegeLondon / pyrealm

Give the option to choose between different functions in calc_viscosoty_h2o #96

100 replaces the outer calls completely with simple sequential sums. I suspect that there is a more efficient formulation - lets see how this new implementation affects the runtime breakdown.

100 replaces the `outer` calls completely with simple sequential sums. I suspect that there is a more efficient formulation - lets see how this new implementation affects the runtime breakdown.