Pre-initialize output `xarray` time coordinates and variables

aufdenkampe commented 9 months ago

Preinitializing an array at the start of a computation is well known to improve performance, because updating values in the array is much more CPU and memory efficient than resizing or appending to an array.

This issue separates a sub-thread started in this issue:

https://github.com/EcohydrologyTeam/ClearWater-modules/issues/57#issuecomment-1857094494

In that comment, @xaviernogueira describes the results of a simple test in the compute_engine_testing.ipynb notebook (currently in the 57-improve-timestep-performance-further branch) that showed it improved runtime by 4x-5x. See the link above for a discussion of these results.

NOTE that we expect all users to know exactly how many time steps they want to run before starting a simulation, so we do not need to preserve that current flexibility in the modeling system.

sjordan29 commented 9 months ago

@aufdenkampe, I started working on this late last week by creating a profiling script and notebook as a benchmark. My plan for this week was to pre-init and run the same suite of tests.

Notebook: performance_profiling_tsm.ipynb
Profiling Script: performance_profiling_tsm.py

These could also serve as a point of comparison for any testing you do with xarray-simlab. I can work on making the profiling script more generalized/user friendly.

aufdenkampe commented 9 months ago

@sjordan29, having a common set of profiling scripts that we use for our various test branches sounds very useful! Thank you!

I just created a new Milestone for tracking these collective efforts: Performance enhancements for NSM demo, and I want to create a number of specific issues for us to each test and characterize with our different approaches.

aufdenkampe commented 9 months ago

Pasting https://github.com/EcohydrologyTeam/ClearWater-modules/issues/57#issuecomment-1857094494 (from @xaviernogueira on Dec. 14) here, to keep discussion on this specific issue in one place:

Strategy of writing to a pre-initialized set of xarray time slices

This was proposed by @aufdenkampe earlier as a potential avenue for performance increases.

While I have not had time to implement a fully version into the existing code for true apples to apples comparison (more on that later), I did make simplified xarray workflows in the compute_engine_testing.ipynb notebook to gage if the approach is worth developing.

The TLDR: Computation becomes 4-5x faster, however, implementation may involve some tradeoffs, especially when the user hasn't pre-decided how many timesteps will be executed.

Performance Comparison

I used the tutorial air_temperature Dataset from xarray, which is a (lat: 25, lon: 53, time: 50) grid.

The test was adding 50 extra time steps where a mock water_temp_c (renamed air temperature) is updated to +10 more than the last timestep.

Test 1:

I implement the current algo that is in clearwater_modules: copy the last timestep (makes a Dataset with len(time=1)), update it's values, the concatenate it at the end of our existing Dataset.

Time: 364ms

Test 2:

Here we make another (lat: 25, lon: 53, time: 50) dataset, where time ranges from 1+ the last time of the existing dataset, to 50+. All values are initialized as np.NaN. Then concat to our existing dataset, and iterate over the new indices writing to each time slice.

Creating the new Dataset and concatenating it takes about 9ms (O(0) time since it runs once). The test itself

Considerations:

An array the size of your output needs to be initialized upon startup, which can potentially hog memory for a very long model run, or make workflows where the data is saved to a file at regular intervals more difficult to implement.

Not a massive issue, but if the user doesn't know how many timesteps they will run a-priori, the code will have to make a "guess", and add those time slices with np.NaN. This can make the dataset look odd in between timesteps and can make debugging more difficult. Additionally, if one runs more timesteps than the guess, there needs to be logic to add in another "buffer" of empty time slices to write too.

One option around the above issue is to simply require the user to set a model iteration time, but is this desired?

That said, 4-5x faster is hard to argue against. Implementation shouldn't be too bad.

If the desired behavior is well constrained, I am happy to implement it, potentially past my office time in one shot. This can be discussed in our meeting this Friday (12/15/2023).

sjordan29 commented 8 months ago

@aufdenkampe, some findings (also outlined in performance_profiling_tsm.ipynb)

I worked to pre-initialize the model xarray and compare against the baseline model. When we track dynamic variables, we see some interesting behavior. On the graphs, each panel has the time/iteration on the y axis, the grid size on the x-axis, and each panel represents a different number of iterations (e.g., the top panel is for 1 iteration, the second panel is for 10 iterations, etc.).

For a low number of iterations and 1-1,000 cells, we see the baseline is faster, but when we start to get to a large number of cells and timesteps, the baseline computation time takes off while the updated computation time remains steady. I think it is slower than the baseline up front because of the additional overhead memory required to store an empty array over both space and time for all dynamic variables. 699d8085-65aa-4fcb-9d81-99cbd4397fae

When we stop tracking dynamic variables and compare the two models, we see better outcomes for our updated pre-initialized model. For single-cell models, it is slightly slower (by <0.002 seconds), but once we get up to more than one cell, the pre-initialized model is faster. Once we get up to high numbers of cells and timesteps, the pre-initialized model is more than 10x faster than the baseline model 0d3320cf-544a-4715-b8ff-9532c57a9e02

I imagine we can get the 1-cell model faster by:

Reducing dimensionality of state / dynamic variables if they aren't variable over the model mesh
Addressing #62
Additional memory management via #70

I'd also propose we set the default to not track dynamic variables so that new users don't have this computational expense by default (currently default is to track dynamic variables).

aufdenkampe commented 8 months ago

@sjordan29, those are really interesting results! Thanks for that impressive analysis. Your suggestions for next steps all sound spot-on.

The performance advantages are quite impressive for >10,000 time steps and grids >1,000 cells, which I think is where we need it the most.

It might be interesting to try a 10-cell or 100-cell model, just to get an idea of some intermediate differences.

For the slowdowns with larger number of time steps, I'm thinking that the answer might be:

69

That way, we're writing the outputs to storage and freeing memory after we're run those time steps.

sjordan29 commented 8 months ago

Great -- I'll run those additional tests to fill in the gaps. Then I think I can clean this branch up and merge to main, and we can pivot to some of the memory management issues.

sjordan29 commented 8 months ago

Ran the tests for 10 and 100 timesteps as well. Start seeing the benefits at 100 iterations. Seems like the 1 and 10 timestep are more impacted by some natural variability since we're not averaging out over a large number of timesteps.

9f9bbf3f-9768-4839-9aa1-23d233484089

sjordan29 commented 8 months ago

@aufdenkampe, a couple questions/notes while I get the tests updated.

I've made time_steps a required input. We could, optionally, default to a certain number of timesteps to run but this seems like a waste of memory and could promote inefficient use of the model.
I'll need to adjust how we handle a hotstart dataset. My sense is that we should grab the last timestep from a hotstart dataset and drop the rest of it to save on memory. Users could merge the two datasets on their own if desired. Any thoughts/preferences?

aufdenkampe commented 8 months ago

I like both of you suggestions.

Require number of time_steps (or a time series / coordinate) as input.
For a hotstart dataset, grab the last tilmestep by default. There's no need to hold previous timesteps in memory while computing.

sjordan29 commented 8 months ago

Sounds good. I am leaving time_steps as an integer in the original implementation to prevent this branch from getting too big, but opened #75 to address this in the future. Hotstart datasets have been addressed as described above and now all tests are passing.

I will probably also need to update example notebooks and documentation now that I am thinking of it.

Do you want me to work on a PR to main, or should we keep these speed/memory enhancements on a dev branch for now?

aufdenkampe commented 8 months ago

@sjordan29, that all sounds awesome. I'm a fan of keeping feature branches short-lived, so merging early and often. That said, if the example notebooks won't run, then it's probably a good idea to fix those first. More extensive docs could come in a second PR.

EcohydrologyTeam / ClearWater-modules

Pre-initialize output `xarray` time coordinates and variables #68

Strategy of writing to a pre-initialized set of `xarray` time slices

Performance Comparison

Test 1:

Test 2:

69

EcohydrologyTeam / ClearWater-modules

Pre-initialize output `xarray` time coordinates and variables #68

Strategy of writing to a pre-initialized set of xarray time slices

Performance Comparison

Test 1:

Test 2:

69

Strategy of writing to a pre-initialized set of `xarray` time slices