EcohydrologyTeam / ClearWater-modules

A collection of water quality and vegetation process simulation modules designed to couple with water transport models.
MIT License
5 stars 1 forks source link

Improve timestep performance further: `numba` or not? #57

Closed xaviernogueira closed 5 months ago

xaviernogueira commented 11 months ago

@sjordan29 found TSM to be too slow -> it was nearly 80% of the time spent compared to 20% on ClearWater-riverine, when running over a grid of flow fields from HEC-RAS-2D.

Ideas to test:

aufdenkampe commented 11 months ago

@sjordan29 did some performance profiling for a test case of ClearWater-modules TSM coupled with ClearWater-riverine transport. See https://github.com/EcohydrologyTeam/ClearWater-riverine/blob/43-integrate-tsm/examples/dev_sandbox/43_Realistic_TSM_Scenario.ipynb.

She shared her cProfile results (profile.prof), which I explored in SnakeViz. It did a great job of identifying which functions in our code were the bottlenecks. From what I see, the top 3 were

Below that, the time is spent primarily in xarray merge, copy, and init (create) functions such as:

Some of these take so much time because they are called a lot of times.

So, the key progress will be made by reducing our use of xarray functions that call these merge/copy/init functions.

@xaviernogueira, the profile that Sarah shared with us yesterday was https://github.com/pyutils/line_profiler, which can tell you what is happening line by line within a function. So it complements the cProfile/Snakeviz perspectives that you have used. Please do some line profiling of these functions:

I also want to go over the approach of taking slices of a pre-existing array for reading/writing values. This is a combination of your bottom two checkboxes above, which should be combined into a single solution. Let's discus this.

xaviernogueira commented 11 months ago

Numba vs no-numba

Numba

I noticed that (as one might expect with understanding of JIT) running one timestep is much slower (per timestep) than running many. This is because the Just-In-Time compilation done by numba only runs the first time thru a loop (full TSM calculations).

# speed of just 1 timestep
Increment timestep speed (average of 1): 6.521975040435791
Run time: 6.548974990844727

# speed of 100 timesteps
Increment timestep speed (average of 100): 0.09739033937454224 (67x faster than running 1)
Run time: 9.782024145126343

# speed of 1000 timesteps
Increment timestep speed (average of 1000): 0.030201313734054564 (215x faster than running 1, 3x faster than running 100)
Run time: 30.230324506759644

# speed of 10,000 timesteps
Increment timestep speed (average of 10000): 0.026183737969398498
Run time: 261.8664631843567

Notice how the the majority of the run-time is the first timestep / JIT compilation! The more timesteps being ran, the smaller % of total run-time the initial compilation becomes.

No-Numba

Perhaps due to the # of process functions that get JIT compiled in the first loop, the compilation stage does significantly slow things down! Here is the same 100 loops without numba (test_no_numba branch).

# speed of just 1 timestep
Increment timestep speed (average of 1): 0.6430933475494385
Run time: 0.6760678291320801

# speed of 100 timesteps
Increment timestep speed (average of 100): 0.027670950889587403 (23x faster than running 1)
Run time: 2.797003746032715

# speed of 1000 timesteps
Increment timestep speed (average of 1000): 0.022910058975219725 (28x faster than running 1.2x faster than running 100)
Run time: 22.939059734344482

# speed of 10,000 timesteps
Increment timestep speed (average of 10000): 0.02401784040927887
Run time: 240.20740365982056

Note that the speed per timestep flattens out at around 0.024 seconds. Therefore many timestep calculations can are completed in a linear fashion, appox (first_timestep_time + (total_timesteps - 1)* 0.24) on my machine.

Discussion

PeterKlaver commented 11 months ago

Typical models have 10,000s to 100,000s of time steps.

xaviernogueira commented 11 months ago

@PeterKlaver got it, just added a 10k timestep test

PeterKlaver commented 11 months ago

It's not unusual to do water quality simulations over a full recreation season, say, to evaluate compliance with recreational use criteria. That's seven months - if the model time step is 30 seconds, the total N is ~ 600,000. If the time step is 10 seconds, N ~ 1,800,000. You can do the math. But if you see linear scaling up to around 10K it is probably okay to extrapolate from there.

xaviernogueira commented 11 months ago

Strategy of writing to a pre-initialized set of xarray time slices

Performance Comparison

Test 1:

I implement the current algo that is in clearwater_modules: copy the last timestep (makes a Dataset with len(time=1)), update it's values, the concatenate it at the end of our existing Dataset. image

Time: 364ms

Test 2:

Here we make another (lat: 25, lon: 53, time: 50) dataset, where time ranges from 1+ the last time of the existing dataset, to 50+. All values are initialized as np.NaN. Then concat to our existing dataset, and iterate over the new indices writing to each time slice.

Creating the new Dataset and concatenating it takes about 9ms (O(0) time since it runs once). The test itself image

Considerations:

If the desired behavior is well constrained, I am happy to implement it, potentially past my office time in one shot. This can be discussed in our meeting this Friday (12/15/2023).

aufdenkampe commented 9 months ago

Spinning off https://github.com/EcohydrologyTeam/ClearWater-modules/issues/57#issuecomment-1857094494 into a new issue:

aufdenkampe commented 9 months ago

Based on @xaviernogueira's Numba vs no-numba test findings described in https://github.com/EcohydrologyTeam/ClearWater-modules/issues/57#issuecomment-1856667822, he found that:

This surprising result is is probably due to the built-in performance of using "vectorized" calculations enabled by numpy, xarray, and dask-array.

The difference is negligible of after about 10,000 time steps, so this doesn't deserve a huge effort to refactor away from numba.

That said, this does provide evidence that we should slowly start moving away from numba, given that it also substantially increases code complexity versus vectorized array calculations.

NOTE: There is still a case for using numba when a "vectorized" approach isn't available, but fortunately those are likely to be rare due to the extensive development of numpy and xarray.

aufdenkampe commented 5 months ago

Closing this issue, which was about exploring approaches to improving performance. With the results described above, we created the following issues for tracking fixes: