E3SM-Project / e3sm_to_cmip

Tools to CMORize E3SM output
https://e3sm-to-cmip.readthedocs.io/en/latest/
MIT License
7 stars 7 forks source link

Refactor significant portions of the codebase and refactor some land-atm handlers #213

Closed tomvothecoder closed 1 year ago

tomvothecoder commented 1 year ago

Overview

This PR was originally supposed only address #115, but I found the codebase difficult to work with so I decided to refactor it.

Regression Testing

TLDR: All of the datasets produced by this branch align with the master branch. The max relative differences between 5/109 datasets are insignificant and can be attributed to floating point rounding error produced by Xarray vs. CDAT, which is fine.


Setup:

  1. Generated branch datasets using scripts/branch-regression-testing/115-cdat-refactor-test/115-end-to-end-script.sh to compare against master branch datasets.
  2. Compared datasets between branches using scripts/branch-regression/115-comparison-notebook.ipynb

Results:

Summary of Changes

Core Changes

  1. Refactor lib.py (now deleted)
    • Closes #190
    • Replace run_parallel() and run_serial() with __main__.py -> E3SMtoCMIP._run_parallel() and E3SMtoCMIP._run_serial()
      • Updated var name filepaths to vars_to_filepaths for clarity (its a dict mapping var key to list of string paths)
      • Update args passed to handler method, remove unnecessary args
      • Add try and except statement for submitting pool jobs to maintain compatibility with MPAS variable handlers, which use different handler method arguments
    • Replace handle_variables(), get_dimension_data() and load_axis() with VarHandler.cmorize()
    • Delete handle_simple() -- will be re-implemented from scratch in #130
  2. Update handler.py
    • Update VarHandler.cmorize()
      • Refactored significantly -- extracted logic into smaller, maintainable private methods
      • Replaced data dictionary storing xr.DataArrays with ds (xr.Dataset object)
      • Distinguish CMOR usage with "cmor" string in function names and python variables
      • Add support for hybrid sigma variables: pfull, phalf
    • Add VarHandler._cmor_write()
      • Add support for CMORizing fixed-time variables
      • Separate PR for #217
      • If time dimension exists, CMORize the variable with all time and time bound values with a single call to cmor.write() instead of looping over each time value index and CMORizing each slice -- this should improve performance and removes the tqdm progress bar.
  3. Update handlers.yaml
    • Add phalf and pfull entries
      • Closes #115
      • Delete phalf.py and pfull.py
    • Add clcalispo entry
      • Closes #218
      • Delete clcalipso.py
  4. Update _formulas.py
    • Update all function argument types and return types from np.ndarray to xr.DataArray
    • Add formula functions for pfull and phalf
    • Add convert_units() function, which handles 1-to-1 unit conversions -- replaces default.default_handlers.write_data()
  5. Refactor default.py (now deleted)
    • Replaced by VarHandler.cmorize() and _formulas.py.convert_units()

Clean Up Changes

  1. Remove cdms2 and cdutil from dependencies in dev.yml and ci.yml7
  2. Clean up legacy code in clisccp.py
    • Separate PR for #218
  3. Add Makefile for easy access to commonly used commands (e.g., building and installing package)
tomvothecoder commented 1 year ago

Moved the section below out of the description to make the description shorter.

Technical Debt in this Repo

I've noticed many Python anti-patterns and "not good" (bad) coding practices used in this repository. This has resulted in a lot of technical debt for new developers of this repo.

tomvothecoder commented 1 year ago

Hi @chengzhuzhang, this PR is finally ready for review! I performed thorough regression testing and everything checks out.

The PR description includes the core changes that you might want to focus on. Thanks!

tomvothecoder commented 1 year ago

Thanks @chengzhuzhang! I will merge this now.