Closed aufdenkampe closed 8 months ago
@aufdenkampe, I started working on this late last week by creating a profiling script and notebook as a benchmark. My plan for this week was to pre-init and run the same suite of tests.
These could also serve as a point of comparison for any testing you do with xarray-simlab. I can work on making the profiling script more generalized/user friendly.
@sjordan29, having a common set of profiling scripts that we use for our various test branches sounds very useful! Thank you!
I just created a new Milestone for tracking these collective efforts: Performance enhancements for NSM demo, and I want to create a number of specific issues for us to each test and characterize with our different approaches.
Pasting https://github.com/EcohydrologyTeam/ClearWater-modules/issues/57#issuecomment-1857094494 (from @xaviernogueira on Dec. 14) here, to keep discussion on this specific issue in one place:
Strategy of writing to a pre-initialized set of
xarray
time slices
- This was proposed by @aufdenkampe earlier as a potential avenue for performance increases.
- While I have not had time to implement a fully version into the existing code for true apples to apples comparison (more on that later), I did make simplified
xarray
workflows in thecompute_engine_testing.ipynb
notebook to gage if the approach is worth developing.- The TLDR: Computation becomes 4-5x faster, however, implementation may involve some tradeoffs, especially when the user hasn't pre-decided how many timesteps will be executed.
Performance Comparison
- I used the tutorial
air_temperature
Dataset fromxarray
, which is a (lat: 25, lon: 53, time: 50) grid.- The test was adding 50 extra time steps where a mock
water_temp_c
(renamed air temperature) is updated to +10 more than the last timestep.Test 1:
I implement the current algo that is in
clearwater_modules
: copy the last timestep (makes a Dataset with len(time=1)), update it's values, the concatenate it at the end of our existing Dataset.Time: 364ms
Test 2:
Here we make another (lat: 25, lon: 53, time: 50) dataset, where time ranges from 1+ the last time of the existing dataset, to 50+. All values are initialized as
np.NaN
. Then concat to our existing dataset, and iterate over the new indices writing to each time slice.Creating the new Dataset and concatenating it takes about 9ms (O(0) time since it runs once). The test itself
Considerations:
- An array the size of your output needs to be initialized upon startup, which can potentially hog memory for a very long model run, or make workflows where the data is saved to a file at regular intervals more difficult to implement.
- Not a massive issue, but if the user doesn't know how many timesteps they will run a-priori, the code will have to make a "guess", and add those time slices with
np.NaN
. This can make the dataset look odd in between timesteps and can make debugging more difficult. Additionally, if one runs more timesteps than the guess, there needs to be logic to add in another "buffer" of empty time slices to write too.- One option around the above issue is to simply require the user to set a model iteration time, but is this desired?
- That said, 4-5x faster is hard to argue against. Implementation shouldn't be too bad.
If the desired behavior is well constrained, I am happy to implement it, potentially past my office time in one shot. This can be discussed in our meeting this Friday (12/15/2023).
@aufdenkampe, some findings (also outlined in performance_profiling_tsm.ipynb)
I worked to pre-initialize the model xarray and compare against the baseline model. When we track dynamic variables, we see some interesting behavior. On the graphs, each panel has the time/iteration on the y axis, the grid size on the x-axis, and each panel represents a different number of iterations (e.g., the top panel is for 1 iteration, the second panel is for 10 iterations, etc.).
For a low number of iterations and 1-1,000 cells, we see the baseline is faster, but when we start to get to a large number of cells and timesteps, the baseline computation time takes off while the updated computation time remains steady. I think it is slower than the baseline up front because of the additional overhead memory required to store an empty array over both space and time for all dynamic variables.
When we stop tracking dynamic variables and compare the two models, we see better outcomes for our updated pre-initialized model. For single-cell models, it is slightly slower (by <0.002 seconds), but once we get up to more than one cell, the pre-initialized model is faster. Once we get up to high numbers of cells and timesteps, the pre-initialized model is more than 10x faster than the baseline model
I imagine we can get the 1-cell model faster by:
I'd also propose we set the default to not track dynamic variables so that new users don't have this computational expense by default (currently default is to track dynamic variables).
@sjordan29, those are really interesting results! Thanks for that impressive analysis. Your suggestions for next steps all sound spot-on.
The performance advantages are quite impressive for >10,000 time steps and grids >1,000 cells, which I think is where we need it the most.
It might be interesting to try a 10-cell or 100-cell model, just to get an idea of some intermediate differences.
For the slowdowns with larger number of time steps, I'm thinking that the answer might be:
That way, we're writing the outputs to storage and freeing memory after we're run those time steps.
Great -- I'll run those additional tests to fill in the gaps. Then I think I can clean this branch up and merge to main, and we can pivot to some of the memory management issues.
Ran the tests for 10 and 100 timesteps as well. Start seeing the benefits at 100 iterations. Seems like the 1 and 10 timestep are more impacted by some natural variability since we're not averaging out over a large number of timesteps.
@aufdenkampe, a couple questions/notes while I get the tests updated.
time_steps
a required input. We could, optionally, default to a certain number of timesteps to run but this seems like a waste of memory and could promote inefficient use of the model.I like both of you suggestions.
time_steps
(or a time series / coordinate) as input.Sounds good. I am leaving time_steps
as an integer in the original implementation to prevent this branch from getting too big, but opened #75 to address this in the future. Hotstart datasets have been addressed as described above and now all tests are passing.
I will probably also need to update example notebooks and documentation now that I am thinking of it.
Do you want me to work on a PR to main, or should we keep these speed/memory enhancements on a dev branch for now?
@sjordan29, that all sounds awesome. I'm a fan of keeping feature branches short-lived, so merging early and often. That said, if the example notebooks won't run, then it's probably a good idea to fix those first. More extensive docs could come in a second PR.
Preinitializing an array at the start of a computation is well known to improve performance, because updating values in the array is much more CPU and memory efficient than resizing or appending to an array.
This issue separates a sub-thread started in this issue:
In that comment, @xaviernogueira describes the results of a simple test in the
compute_engine_testing.ipynb
notebook (currently in the57-improve-timestep-performance-further
branch) that showed it improved runtime by 4x-5x. See the link above for a discussion of these results.NOTE that we expect all users to know exactly how many time steps they want to run before starting a simulation, so we do not need to preserve that current flexibility in the modeling system.