E3SM-Project / zppy

E3SM post-processing toolchain
BSD 3-Clause "New" or "Revised" License
6 stars 13 forks source link

Better automate variable derivations in post-processing workflows #605

Open forsyth2 opened 3 weeks ago

forsyth2 commented 3 weeks ago

Request criteria

Issue description

Currently, variable derivations are handled on a per-package basis. For example, in the global_time_series task, the derivations are handled in https://github.com/E3SM-Project/zppy/blob/main/zppy/templates/readTS.py and in the e3sm_diags package, the derivations are handled in https://github.com/E3SM-Project/e3sm_diags/blob/main/e3sm_diags/derivations/acme.py.

It would make more sense for derivations to be handled uniformly. Possible options:

  1. Have the model itself derive variables, listing derived variables along with original values in output.
  2. Doing the above, but rather than in the model, do it as a separate step before the rest of the post-processing workflow.
  3. Create a package to derive variables as-needed. E.g., if someone requests a derived variable, the e3sm_diags package and the global_time_series zppy task would both call this new package to derive it from the given data.

It's possible a generic package (e.g., a symbolic/computer algebra library) could accomplish (3) without much extra work from us.

forsyth2 commented 3 weeks ago

Since the e3sm_diags package has a thorough derivations section (https://github.com/E3SM-Project/e3sm_diags/blob/main/e3sm_diags/derivations/acme.py), we could potentially just move that out into a package that can be called by others.

forsyth2 commented 3 weeks ago

SymPy is a symbolic math library for Python

forsyth2 commented 2 weeks ago

https://github.com/E3SM-Project/e3sm_diags/blob/main/e3sm_diags/derivations/acme.py seems to be composed of more or less the following sections: L19-619: Functions to convert between variables and/or units, which may be called by multiple other functions. Generally, but not always, the arguments to these functions are variables (as type cdms.TransientVariable, which will of course be replaced in the CDAT migration effort). L2163-2550 is similar, but many of those functions make updates to the derived variables dict.

L619-2161 (the derived variables dict) is an dictionary mapping variables (as strings) to ordered dictionaries mapping variables (as strings) to functions. I'm assuming by using ordered dictionaries, the code will then go through the possible substitutions in that prescribed order.

The logic of deriving variables actually extends further into https://github.com/E3SM-Project/e3sm_diags/blob/main/e3sm_diags/e3sm_diags_vars.py check_for_derived_vars.

This block almost makes it look like we'd need all possible base variables present in the user's file (i.e., there's no filtering on possible_vars)

        if var in derived_variables:
            # Ex: {('PRECC', 'PRECL'): func, ('pr',): func1, ...}.
            vars_to_func_dict = derived_variables[var]
            # Ex: [('pr',), ('PRECC', 'PRECL')].
            possible_vars = vars_to_func_dict.keys()  # type: ignore

            var_added = False
            for list_of_vars in possible_vars:
                if not var_added and vars_in_user_file.issuperset(list_of_vars):
                    # All of the variables (list_of_vars) are in the input file.
                    # These are needed.
                    vars_used.extend(list_of_vars)
                    var_added = True
            # If none of the original vars are in the file, just keep this var.
            # This means that it isn't a derived variable in E3SM.
            if not var_added:
                vars_used.append(var)
forsyth2 commented 2 weeks ago

I feel like a recursive approach as in https://github.com/E3SM-Project/zppy/blob/main/zppy/templates/readTS.py would be the cleanest. It would be easier to follow than the derived variable dictionary. However, short of re-implementing the entire derivation code to check, I'm not sure it would fully cover everything.

def get_var(var_name: str, defined_vars: Dict[str, var]) -> var:
  if var_name in defined_vars:
    return defined_vars[var_name]
  elif var_name == "PRECT":
    pr = get_var("pr", defined_vars)
    if pr:
     return(qflxconvert_units(pr))
   # Try second derivation method
   precc = get_var("PRECC")
   precl = get_var("PRECL")
   if precc and precl:
     return prect(precc, precl)
   # Try third derivation method
   ...
  else:
    # Could not define the variable
    return None

It's possible the third-party symbolic algebra package would be the cleanest solution. I suppose we could try to define the variables as symbols in SymPy and work from there, but we may have too much going on here -- names of variables, and also their values and units.

forsyth2 commented 2 weeks ago

@xylar Do you know of any packages or algorithms that would handle something like this well? (This is a lower-priority item; it's just something that has come up a few times now as being potentially useful).

Or maybe option (1)/(2) below would be the better path forward?

  1. Have the model itself derive variables, listing derived variables along with original values in output.
  2. Doing the above, but rather than in the model, do it as a separate step before the rest of the post-processing workflow.
  3. Create a package to derive variables as-needed. E.g., if someone requests a derived variable, the e3sm_diags package and the global_time_series zppy task would both call this new package to derive it from the given data.
xylar commented 1 week ago

@forsyth2, thanks for pinging me on this. I don't have any experience with this myself. I haven't tried to allow users to define their own new products and such.