E3SM-Project / e3sm_diags

E3SM Diagnostics package
https://e3sm-project.github.io/e3sm_diags
BSD 3-Clause "New" or "Revised" License
42 stars 32 forks source link

CDAT Migration Phase 2: Refactor arm_diags set #842

Closed chengzhuzhang closed 1 month ago

chengzhuzhang commented 3 months ago

Description

Additional Changes

arm_diags_driver.py and arm_diags_viewer.py

derivations.py and formulas.py

arm_diags_plot.py

arm_diags_parameter.py

dataset_xr.py

diurnal_cycle_xr.py

Checklist

If applicable:

chengzhuzhang commented 2 months ago

Updates: All sub sets code refactoring including (annual_cycle, diurnal_cycle(_zt), convection_onset, aerosol_activation) are completed.

Todo: Fix mypy issues Potential performance bottleneck from deriving variables using xarray dataset as input

tomvothecoder commented 2 months ago

Potential performance bottleneck from deriving variables using xarray dataset as input

Thank you for identifying the two formulas code as potential bottlenecks. I confirmed that these computations with Xarray are indeed much slower than CDAT.

Solution

I ran a performance benchmark and found the solution to the slowness: we need to call .load(scheduler="sync") in _get_dataset_with_source_vars() to speed up the computation. I pushed commit ef44fc6 (#842) with this fix.

Benchmark Results

The first runtime is the current code and the second runtime is with .load(). I also ran e3sm_diags with commit ef44fc6 (#842) and confirmed a significant runtime improvement, similar to the benchmarks below. Performance is now on-par with CDAT.

"""
Results
----------
1. Elapsed time (Xarray non-chunked): 6.540755605790764 seconds
2. Elapsed time (Xarray non-chunked with .load()): 0.17097265785560012 seconds
3. Elapsed time (Xarray chunked): 0.1452920027077198 seconds
4. Elapsed time (numpy .values): 6.418793010059744 seconds
5. Elapsed time (numpy .data): 7.334999438840896 seconds
"""
Elapsed time (CDAT main branch, single runtime): 0.12261438369750977 seconds
chengzhuzhang commented 2 months ago

The first runtime is the current code and the second runtime is with .load(). I also ran e3sm_diags with commit ef44fc6 (#842)

@tomvothecoder thank you for performing the timing test. Interesting that when I test your commit. Running through a configuration with each subset, it took 24 mins. Without, .load(), the total time is about 3 mins, which I consider at least on par with cdat code. In this case, maybe we should drop the .load() change?

tomvothecoder commented 2 months ago

The first runtime is the current code and the second runtime is with .load(). I also ran e3sm_diags with commit ef44fc6 (#842)

@tomvothecoder thank you for performing the timing test. Interesting that when I test your commit. Running through a configuration with each subset, it took 24 mins. Without, .load(), the total time is about 3 mins, which I consider at least on par with cdat code. In this case, maybe we should drop the .load() change?

I would think that loading the derived variables dataset into memory shouldn't slow down performance unless the datasets were extremely large (which we should use Dask chunking for).

I only benchmarked performance for the formula computations. I will benchmark a complete run to verify your findings and determine if we should revert the commit or not.

Side-note:

_It could be that the logic I implemented already stores the dataset in-memory, since it merges multiple xr.Dataset objects opened via open_dataset() (uses numpy arrays) instead of using open_mfdataset() (uses Dask arrays). If this the case, I don't see how .load() would improve the speed of the formula computations though._

https://github.com/E3SM-Project/e3sm_diags/blob/ef44fc6ffd538a5e257b097b99f7a1a79b79bc3b/e3sm_diags/driver/utils/dataset_xr.py#L1056-L1064

tomvothecoder commented 2 months ago

RE: https://github.com/E3SM-Project/e3sm_diags/pull/842#issuecomment-2344166562 I found adding .load() in _get_dataset_with_source_vars() adds 2-4 minutes of runtime to a complete arm_diags run.

I reverted this change and will now address the pre-commit issues.

# Commit: 58361c49-4b1b-11ec-9b3b-9c5c8e2f5e4e (no .load())
# run_set function took 281.78 seconds to complete.
# run_set function took 332.79 seconds to complete.
# Commit: ef44fc6ffd538a5e257b097b99f7a1a79b79bc3b (with .load())
# run_set function took 472.81 seconds to complete.
tomvothecoder commented 2 months ago

Hey @chengzhuzhang, I fixed the pre-commit issues in 2c248fb (#842). I also performed initial code cleanup since I would probably be doing that later anyways. Refer to the commit message and my review comments for more information.

I re-ran all sets and they completed successfully with these changes, although I noticed some of the diagnostic sets weren't done yet.

chengzhuzhang commented 2 months ago

Thank you @tomvothecoder I will do another pass to see if there are other sets need to be refactored. Thanks a lot for fixing and clean this branch.

chengzhuzhang commented 2 months ago

@tomvothecoder I applied png regression tests. And all figures are produced and results are as expected as noted in the notebook. Also performance-wise, the refactored codes are similar to the original codes. wall time is ~ 10mins with 4 workers for the full arm_diags set.

tomvothecoder commented 1 month ago

Hey @chengzhuzhang, I noticed you have several other PRs you're working on right now. I'm happy to help finish up refactoring the last diag function in this PR. Let me know.

chengzhuzhang commented 1 month ago

Hey @chengzhuzhang, I noticed you have several other PRs you're working on right now. I'm happy to help finish up refactoring the last diag function in this PR. Let me know.

@tomvothecoder it would be great if you could help me on finish this last diag, so that I'm not holding back the progress to merge!

tomvothecoder commented 1 month ago

Hey @chengzhuzhang, I noticed you have several other PRs you're working on right now. I'm happy to help finish up refactoring the last diag function in this PR. Let me know.

@tomvothecoder it would be great if you could help me on finish this last diag, so that I'm not holding back the progress to merge!

I found that no sets actually run "annual_cycle_aerosol" (arm_diags_model_vs_obs.cfg, arm_diags_model_vs_model.cfg). Unless we expect to run this diagnostic in the future, I think we can delete _run_diag_annual_cycle_aerosol instead of refactoring it.

chengzhuzhang commented 1 month ago

@tomvothecoder good for catching this. I think for now we can just create an issue to log this problem, and we can add back the code at a later time.

tomvothecoder commented 1 month ago

Regression test results

.png regression test shows all plots are identical except the following, which have missing data (white spaces) compared to main:

Comparing:
    * /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/main/arm_diags/armdiags-CLOUD-ANNUALCYCLE-nsac1-test.png
    * /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/667-arm_diags-final/arm_diags/armdiags-CLOUD-ANNUALCYCLE-nsac1-test.png
     * Difference path /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/667-arm_diags-final/arm_diags_diff/armdiags-CLOUD-ANNUALCYCLE-nsac1-test.png
Comparing:
    * /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/main/arm_diags/armdiags-CLOUD-ANNUALCYCLE-sgpc1-test.png
    * /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/667-arm_diags-final/arm_diags/armdiags-CLOUD-ANNUALCYCLE-sgpc1-test.png
     * Difference path /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/667-arm_diags-final/arm_diags_diff/armdiags-CLOUD-ANNUALCYCLE-sgpc1-test.png
Comparing:
    * /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/main/arm_diags/armdiags-CLOUD-ANNUALCYCLE-twpc1-test.png
    * /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/667-arm_diags-final/arm_diags/armdiags-CLOUD-ANNUALCYCLE-twpc1-test.png
     * Difference path /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/667-arm_diags-final/arm_diags_diff/armdiags-CLOUD-ANNUALCYCLE-twpc1-test.png
Comparing:
    * /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/main/arm_diags/armdiags-CLOUD-ANNUALCYCLE-twpc2-test.png
    * /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/667-arm_diags-final/arm_diags/armdiags-CLOUD-ANNUALCYCLE-twpc2-test.png
     * Difference path /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/667-arm_diags-final/arm_diags_diff/armdiags-CLOUD-ANNUALCYCLE-twpc2-test.png
Comparing:
    * /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/main/arm_diags/armdiags-CLOUD-ANNUALCYCLE-twpc3-test.png
    * /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/667-arm_diags-final/arm_diags/armdiags-CLOUD-ANNUALCYCLE-twpc3-test.png
     * Difference path /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/667-arm_diags-final/arm_diags_diff/armdiags-CLOUD-ANNUALCYCLE-twpc3-test.png
tomvothecoder commented 1 month ago

This PR should be good to go. The integration test image checker fails because there is now one additional plot for the sgpc1 region that was generated with a fix in this PR (related line)

I will merge this PR even though the integration tests fail. I will update the complete test results on LCRC using the cdat-migration-fy24 branch. The complete test results should also include the QBO wavelet plot from #850.