Closed xaviernogueira closed 5 months ago
@sjordan29 did some performance profiling for a test case of ClearWater-modules TSM coupled with ClearWater-riverine transport. See https://github.com/EcohydrologyTeam/ClearWater-riverine/blob/43-integrate-tsm/examples/dev_sandbox/43_Realistic_TSM_Scenario.ipynb.
She shared her cProfile results (profile.prof), which I explored in SnakeViz. It did a great job of identifying which functions in our code were the bottlenecks. From what I see, the top 3 were
Below that, the time is spent primarily in xarray merge, copy, and init (create) functions such as:
Some of these take so much time because they are called a lot of times.
So, the key progress will be made by reducing our use of xarray functions that call these merge/copy/init functions.
@xaviernogueira, the profile that Sarah shared with us yesterday was https://github.com/pyutils/line_profiler, which can tell you what is happening line by line within a function. So it complements the cProfile/Snakeviz perspectives that you have used. Please do some line profiling of these functions:
I also want to go over the approach of taking slices of a pre-existing array for reading/writing values. This is a combination of your bottom two checkboxes above, which should be combined into a single solution. Let's discus this.
I noticed that (as one might expect with understanding of JIT) running one timestep is much slower (per timestep) than running many. This is because the Just-In-Time compilation done by numba only runs the first time thru a loop (full TSM calculations).
# speed of just 1 timestep
Increment timestep speed (average of 1): 6.521975040435791
Run time: 6.548974990844727
# speed of 100 timesteps
Increment timestep speed (average of 100): 0.09739033937454224 (67x faster than running 1)
Run time: 9.782024145126343
# speed of 1000 timesteps
Increment timestep speed (average of 1000): 0.030201313734054564 (215x faster than running 1, 3x faster than running 100)
Run time: 30.230324506759644
# speed of 10,000 timesteps
Increment timestep speed (average of 10000): 0.026183737969398498
Run time: 261.8664631843567
Notice how the the majority of the run-time is the first timestep / JIT compilation! The more timesteps being ran, the smaller % of total run-time the initial compilation becomes.
Perhaps due to the # of process functions that get JIT compiled in the first loop, the compilation stage does significantly slow things down!
Here is the same 100 loops without numba (test_no_numba
branch).
# speed of just 1 timestep
Increment timestep speed (average of 1): 0.6430933475494385
Run time: 0.6760678291320801
# speed of 100 timesteps
Increment timestep speed (average of 100): 0.027670950889587403 (23x faster than running 1)
Run time: 2.797003746032715
# speed of 1000 timesteps
Increment timestep speed (average of 1000): 0.022910058975219725 (28x faster than running 1.2x faster than running 100)
Run time: 22.939059734344482
# speed of 10,000 timesteps
Increment timestep speed (average of 10000): 0.02401784040927887
Run time: 240.20740365982056
Note that the speed per timestep flattens out at around 0.024 seconds. Therefore many timestep calculations can are completed in a linear fashion, appox (first_timestep_time + (total_timesteps - 1)* 0.24) on my machine.
use_numba: bool
).Typical models have 10,000s to 100,000s of time steps.
@PeterKlaver got it, just added a 10k timestep test
It's not unusual to do water quality simulations over a full recreation season, say, to evaluate compliance with recreational use criteria. That's seven months - if the model time step is 30 seconds, the total N is ~ 600,000. If the time step is 10 seconds, N ~ 1,800,000. You can do the math. But if you see linear scaling up to around 10K it is probably okay to extrapolate from there.
xarray
time slicesxarray
workflows in the compute_engine_testing.ipynb
notebook to gage if the approach is worth developing.air_temperature
Dataset from xarray
, which is a (lat: 25, lon: 53, time: 50) grid.water_temp_c
(renamed air temperature) is updated to +10 more than the last timestep.I implement the current algo that is in clearwater_modules
: copy the last timestep (makes a Dataset with len(time=1)), update it's values, the concatenate it at the end of our existing Dataset.
Time: 364ms
Here we make another (lat: 25, lon: 53, time: 50) dataset, where time ranges from 1+ the last time of the existing dataset, to 50+. All values are initialized as np.NaN
. Then concat to our existing dataset, and iterate over the new indices writing to each time slice.
Creating the new Dataset and concatenating it takes about 9ms (O(0) time since it runs once). The test itself
Considerations:
np.NaN
. This can make the dataset look odd in between timesteps and can make debugging more difficult. Additionally, if one runs more timesteps than the guess, there needs to be logic to add in another "buffer" of empty time slices to write too.If the desired behavior is well constrained, I am happy to implement it, potentially past my office time in one shot. This can be discussed in our meeting this Friday (12/15/2023).
Spinning off https://github.com/EcohydrologyTeam/ClearWater-modules/issues/57#issuecomment-1857094494 into a new issue:
Based on @xaviernogueira's Numba vs no-numba test findings described in https://github.com/EcohydrologyTeam/ClearWater-modules/issues/57#issuecomment-1856667822, he found that:
numba
, in his test_no_numba, was always faster than using numba
!This surprising result is is probably due to the built-in performance of using "vectorized" calculations enabled by numpy
, xarray
, and dask-array.
The difference is negligible of after about 10,000 time steps, so this doesn't deserve a huge effort to refactor away from numba
.
That said, this does provide evidence that we should slowly start moving away from numba
, given that it also substantially increases code complexity versus vectorized array calculations.
NOTE: There is still a case for using numba
when a "vectorized" approach isn't available, but fortunately those are likely to be rare due to the extensive development of numpy
and xarray
.
Closing this issue, which was about exploring approaches to improving performance. With the results described above, we created the following issues for tracking fixes:
@sjordan29 found TSM to be too slow -> it was nearly 80% of the time spent compared to 20% on ClearWater-riverine, when running over a grid of flow fields from HEC-RAS-2D.
Ideas to test:
np.nan
and writing to that may be faster than writing over existing values (current behavior)? It's worth checking, especially if we can pre-init many time steps.Model.__init__
, and then the timestep you want to write onto can be passed as an argument into the run timestep function. This may also increase performance.68
77