xaviernogueira commented 11 months ago

@sjordan29 found TSM to be too slow -> it was nearly 80% of the time spent compared to 20% on ClearWater-riverine, when running over a grid of flow fields from HEC-RAS-2D.

Ideas to test:

[x] Have a way to skip writing intermediate arrays to xarray if we are not keeping them. Writing to xarray is a slow point.
- UPDATE: Done with commits 484fd00 and 7d85a01 within PR#77
[X] Pre-initing a np.nan and writing to that may be faster than writing over existing values (current behavior)? It's worth checking, especially if we can pre-init many time steps.
- UPDATE: Done with commit 52ec704 within PR#77
[x] @aufdenkampe had the idea of just writing to slices of a pre-initiated xarray with a time dimension. The way I understand it, this could be implemented first via an optional argument to Model.__init__, and then the timestep you want to write onto can be passed as an argument into the run timestep function. This may also increase performance.
- UPDATE: Done with:
- 68
- 77

aufdenkampe commented 11 months ago

@sjordan29 did some performance profiling for a test case of ClearWater-modules TSM coupled with ClearWater-riverine transport. See https://github.com/EcohydrologyTeam/ClearWater-riverine/blob/43-integrate-tsm/examples/dev_sandbox/43_Realistic_TSM_Scenario.ipynb.

She shared her cProfile results (profile.prof), which I explored in SnakeViz. It did a great job of identifying which functions in our code were the bottlenecks. From what I see, the top 3 were

clearwater_modules\base.py:332(increment_timestep)
clearwater_modules\utils.py:53(iter_computations)
clearwater_riverine\transport.py:159(update)

Below that, the time is spent primarily in xarray merge, copy, and init (create) functions such as:

xarray\core\merge.py:196(merge_collected)
xarray\core\variable.py:230(as_compatible_data)
xarray\core\variable.py:337(init)
xarray\core\variable.py:914(_copy)
many more

Some of these take so much time because they are called a lot of times.

So, the key progress will be made by reducing our use of xarray functions that call these merge/copy/init functions.

@xaviernogueira, the profile that Sarah shared with us yesterday was https://github.com/pyutils/line_profiler, which can tell you what is happening line by line within a function. So it complements the cProfile/Snakeviz perspectives that you have used. Please do some line profiling of these functions:

clearwater_modules\base.py:332(increment_timestep)
clearwater_modules\utils.py:53(iter_computations)

I also want to go over the approach of taking slices of a pre-existing array for reading/writing values. This is a combination of your bottom two checkboxes above, which should be combined into a single solution. Let's discus this.

xaviernogueira commented 11 months ago

Numba vs no-numba

Numba

I noticed that (as one might expect with understanding of JIT) running one timestep is much slower (per timestep) than running many. This is because the Just-In-Time compilation done by numba only runs the first time thru a loop (full TSM calculations).

# speed of just 1 timestep
Increment timestep speed (average of 1): 6.521975040435791
Run time: 6.548974990844727

# speed of 100 timesteps
Increment timestep speed (average of 100): 0.09739033937454224 (67x faster than running 1)
Run time: 9.782024145126343

# speed of 1000 timesteps
Increment timestep speed (average of 1000): 0.030201313734054564 (215x faster than running 1, 3x faster than running 100)
Run time: 30.230324506759644

# speed of 10,000 timesteps
Increment timestep speed (average of 10000): 0.026183737969398498
Run time: 261.8664631843567

Notice how the the majority of the run-time is the first timestep / JIT compilation! The more timesteps being ran, the smaller % of total run-time the initial compilation becomes.

No-Numba

Perhaps due to the # of process functions that get JIT compiled in the first loop, the compilation stage does significantly slow things down! Here is the same 100 loops without numba (test_no_numba branch).

# speed of just 1 timestep
Increment timestep speed (average of 1): 0.6430933475494385
Run time: 0.6760678291320801

# speed of 100 timesteps
Increment timestep speed (average of 100): 0.027670950889587403 (23x faster than running 1)
Run time: 2.797003746032715

# speed of 1000 timesteps
Increment timestep speed (average of 1000): 0.022910058975219725 (28x faster than running 1.2x faster than running 100)
Run time: 22.939059734344482

# speed of 10,000 timesteps
Increment timestep speed (average of 10000): 0.02401784040927887
Run time: 240.20740365982056

Note that the speed per timestep flattens out at around 0.024 seconds. Therefore many timestep calculations can are completed in a linear fashion, appox (first_timestep_time + (total_timesteps - 1)* 0.24) on my machine.

Discussion

So the question is: do we use numba wrapped functions or not? The ideal solution involves getting a bit fancy and wrapping functions dynamically in a response to a user setting (i.e., use_numba: bool).
In the immediate future however, what direction do you think makes more sense @aufdenkampe , it depends on how many timesteps are typical to run.
For TSM at least, it is unclear if JIT ever becomes worth it. The rate per timestep never got below the non-jit version. This is likely because the functions we have JIT compiled are rather simple math on vectors, without loops (which is where JIT shines). Therefore the time it takes to read from the cached-compiled version + running the compiled code is not that major of a time saving.

PeterKlaver commented 11 months ago

Typical models have 10,000s to 100,000s of time steps.

xaviernogueira commented 11 months ago

@PeterKlaver got it, just added a 10k timestep test

PeterKlaver commented 11 months ago

It's not unusual to do water quality simulations over a full recreation season, say, to evaluate compliance with recreational use criteria. That's seven months - if the model time step is 30 seconds, the total N is ~ 600,000. If the time step is 10 seconds, N ~ 1,800,000. You can do the math. But if you see linear scaling up to around 10K it is probably okay to extrapolate from there.

xaviernogueira commented 11 months ago

Strategy of writing to a pre-initialized set of `xarray` time slices

This was proposed by @aufdenkampe earlier as a potential avenue for performance increases.
While I have not had time to implement a fully version into the existing code for true apples to apples comparison (more on that later), I did make simplified xarray workflows in the compute_engine_testing.ipynb notebook to gage if the approach is worth developing.
The TLDR: Computation becomes 4-5x faster, however, implementation may involve some tradeoffs, especially when the user hasn't pre-decided how many timesteps will be executed.

Performance Comparison

I used the tutorial air_temperature Dataset from xarray, which is a (lat: 25, lon: 53, time: 50) grid.
The test was adding 50 extra time steps where a mock water_temp_c (renamed air temperature) is updated to +10 more than the last timestep.

Test 1:

I implement the current algo that is in clearwater_modules: copy the last timestep (makes a Dataset with len(time=1)), update it's values, the concatenate it at the end of our existing Dataset.

Time: 364ms

Test 2:

Here we make another (lat: 25, lon: 53, time: 50) dataset, where time ranges from 1+ the last time of the existing dataset, to 50+. All values are initialized as np.NaN. Then concat to our existing dataset, and iterate over the new indices writing to each time slice.

Creating the new Dataset and concatenating it takes about 9ms (O(0) time since it runs once). The test itself

Considerations:

An array the size of your output needs to be initialized upon startup, which can potentially hog memory for a very long model run, or make workflows where the data is saved to a file at regular intervals more difficult to implement.
Not a massive issue, but if the user doesn't know how many timesteps they will run a-priori, the code will have to make a "guess", and add those time slices with np.NaN. This can make the dataset look odd in between timesteps and can make debugging more difficult. Additionally, if one runs more timesteps than the guess, there needs to be logic to add in another "buffer" of empty time slices to write too.
One option around the above issue is to simply require the user to set a model iteration time, but is this desired?
That said, 4-5x faster is hard to argue against. Implementation shouldn't be too bad.

If the desired behavior is well constrained, I am happy to implement it, potentially past my office time in one shot. This can be discussed in our meeting this Friday (12/15/2023).

aufdenkampe commented 9 months ago

Spinning off https://github.com/EcohydrologyTeam/ClearWater-modules/issues/57#issuecomment-1857094494 into a new issue:

68

aufdenkampe commented 9 months ago

Based on @xaviernogueira's Numba vs no-numba test findings described in https://github.com/EcohydrologyTeam/ClearWater-modules/issues/57#issuecomment-1856667822, he found that:

Removing numba, in his test_no_numba, was always faster than using numba!

This surprising result is is probably due to the built-in performance of using "vectorized" calculations enabled by numpy, xarray, and dask-array.

The difference is negligible of after about 10,000 time steps, so this doesn't deserve a huge effort to refactor away from numba.

That said, this does provide evidence that we should slowly start moving away from numba, given that it also substantially increases code complexity versus vectorized array calculations.

NOTE: There is still a case for using numba when a "vectorized" approach isn't available, but fortunately those are likely to be rare due to the extensive development of numpy and xarray.

aufdenkampe commented 5 months ago

Closing this issue, which was about exploring approaches to improving performance. With the results described above, we created the following issues for tracking fixes:

62
68
76

EcohydrologyTeam / ClearWater-modules

Improve timestep performance further: `numba` or not? #57

68

77