Closed justin-richling closed 7 months ago
@justin-richling and @nusbaume
Following up on an email I sent you about this... I wonder if we should provide an improved approach for derived variables. If we think bringing in numexpr as a new dependency is a reasonable approach, I think I've got a potential drop-in replacement, as follows:
formula
that would define the derived variable in terms of the derivable_from
list.vars_to_derive
variable like this:
# Loop over CAM history variables:
list_of_commands = []
vars_to_derive = {}
# create copy of var list that can be modified for derivable variables
diag_var_list = self.diag_var_list
for var in diag_var_list:
if var not in hist_file_var_list:
vres = res.get(var, {})
if ("derivable_from" in vres) and ("formula" in vres):
constit_list = vres["derivable_from"]
expression = vres["formula"]
for constit in constit_list:
if constit not in diag_var_list:
diag_var_list.append(constit)
vars_to_derive[var] = [constit_list, expression]
continue
else:
msg = f"WARNING: {var} is not in the file {hist_files[0]}."
msg += " No time series will be generated."
print(msg)
continue
derive_variables
with the following (UNTESTED):
import numexpr as nex
def derive(expression, **inputs): """Derive output using expression, with input data suppled in inputs.
expression : str
A string describing the expression to evaluate, can contain
numbers, variables, and basic operators.
inputs : dict
A dictionary with keys being the variables used in expression
and the values being the data (either xr.DataArray or numpy arrays)
return
The result of expression
NOTES
-----
Example: expression = "PRECC + PRECL"
inputs = {"PRECC": arr1, "PRECL": arr2}
"""
return nex.evaluate(expression, inputs)
def derive_xrWrap(expression, *args, **kwargs):
# maybe this isn't kosher, but this allows the user
# to either provide named arguments
# or a dictionary without using **dict
if (len(args) == 1) and (isinstance(args[0], dict)):
kwargs = args[0]
template = kwargs[[*kwargs][0]] # whatever is the first input
# pop the things that are not part of expression inputs
dims = kwargs.pop("dims", (template.dims if hasattr(template, 'dims') else None))
coords = kwargs.pop("coords", (template.coords if hasattr(template, 'coords') else None))
attrs = kwargs.pop("attrs", template.attrs)
ans = derive(expression, kwargs)
return xr.DataArray(ans, dims=dims, coords=coords, attrs=attrs)
def derive_variables(self, vars_to_derive=None, ts_dir=None, overwrite=None): """ Derive variables according to expression and inputs provided in vars_to_derive dict.
Caution: this method _may still_ assume that there will be one time series file per variable
If the file for the derived variable exists, the kwarg `overwrite` determines
whether to overwrite the file (true) or exit with a warning message.
"""
for var in vars_to_derive:
constit_list = vars_to_derive[var][0]
expression = vars_to_derive[var][1]
# get constituent files:
constit_files = {}
for c in constit_list:
if glob.glob(os.path.join(ts_dir, f"*.{c}.*")):
constit_files[c] = sorted(glob.glob(os.path.join(ts_dir, f"*.{c}.*")))
else:
ermsg = f"{c} files were not present; {var} cannot be calculated."
ermsg += f" Please remove {var} from diag_var_list or find the relevant CAM files."
raise FileNotFoundError(ermsg)
# from the constit_files, get the data
constit_data = {}
for c in constit_files:
if len(constit_files[c] > 1):
constit_data[c] = xr.open_mfdataset(constit_files[c])[c]
else:
constit_data[c] = xr.open_dataset(constit_files[c][0])[c]
# output file specification:
# TODO: this might not work for multi-file cases
derived_file = constit_files[[*constit_files][0]].replace(constit_list[0], var)
if Path(derived_file).is_file():
if overwrite:
Path(derived_file).unlink()
else:
print(
f"[{__name__}] Warning: {var} file was found and overwrite is False. Will use existing file."
)
continue
# derive the variable using expression:
result = derive_xrWrap(expression, **constit_data) # defaults to copying metadata from 1st constituent
# TODO: provide a way to send updated metadata to derive_xrWrap, maybe from additional variable_defaults info
# Save output:
if 'time' in result.dims:
udim = 'time'
else:
udim = None
result.to_netcdf(derived_file, unlimited_dims=udim, mode='w')
I thought about this PR some more. I think given my previous suggestions, it should be acceptable to have derived variables that could be derived from derived variables themselves. For example, say PRECT is derived from PRECC + PRECL. Then the convective fraction of precipitation could be derived as PRECCFRAC = PRECC / PRECT. This seems fine, but requires doing the derived variables in a correct order. I believe that we could just reorder the vars_to_derive
using the following code:
def get_derive_order(vres):
"""provides a valid order for deriving variables
such that derived quantities that are dependencies
for other quantities are earlier in the output list
parameters
----------
vres : dict
dict of each derivable variable with values that are the dependencies (as list)
return
------
list
The list of derived variables in a valid order for derivations to succeed
"""
# use deque for efficient inserting/appending
from collections import deque
# Get the full set of values
derivable = set(vres.keys()) # This also serves as our definition of having dependencies
q = deque()
while len(q) < len(derivable): # every derivable needs to be in q
for v in derivable: # Scan through all derivables
if v not in q:
q.append(v)
vindex = q.index(v) # we might need to know this position
if v in vres:
vdepends = vres[v]
if not isinstance(vdepends, list):
vdepends = [vdepends] # in case it's just a single dependency given as a str
vdtest = [vd for vd in vdepends if vd in vres] # list of dependencies with dependencies
if not vdtest:
continue # v in q and no dependencies have dependencies
else:
vdd = [vd for vd in enumerate(vdtest) if vd in q] # true if already in q
if any(vdd):
ndx = vdd.index(True) # index of first True value
q.insert(vindex, vdd[ndx]) # put ahead of v in q
break # leave the for-loop, to re-start the scan
else:
print(f"ERROR: encountered {v} which is not listed as a derived quantity")
return list(q) # convert deque to list
And that could be used right before calling derive_variables
:
if vars_to_derive:
var_order = get_derive_order(vars_to_derive) # already a dict as per my other suggestion
vars_to_derive_ordered = {k: vars_to_derive[k] for k in var_order}
self.derive_variables(vars_to_derive=vars_to_derive_ordered, ts_dir=ts_dir[case_idx])
@justin-richling -- Sorry for all this code, I think I got carried away! If you and/or @nusbaume would prefer to move ahead with your PR, and then I can put my suggested changes in separately, that would be fine with me. Let me know.
@nusbaume I believe I've addressed your and @brianpm's suggestions/changes. Let me know if anything else needs attention, thanks!
This PR will add calculations for zonal aerosol plots, and add new functionality for colorbar options.
Files changed:
adf_variable_defaults.yaml
adf_diag.py
plotting_functions.py
regrid_and_vert_interp.py: