Python front end - encapsulating variables

RobertPincus commented 7 months ago

In thinking about a user-friendly Python front end we think about how well the Fortran design would work.

Specifying problems

Radiation problems are defined in one of three ways - three sets of "optical properties". Using the Fortran notation:

1scl problems are defined in terms of tau
2str problems are defined in terms of tau, ssa, g
nstr problems are defined in terms of tau, ssa, p (we don't currently have much implemented in Fortran)

Each variable might depend on a vertical coordinate (ilay) and/or a spectral coordinate (igpt); p also has a "moment" dimension (imom). The underlying kernels all operate on more than one profile at a time (icol) as the first dimension.

Planets receive radiation from their host star and emit radiation themselves. For the purposes of gas optics we treat these as distinct, so we have two gas optics variants:

_int for internal sources represent planetary radiation and returns layer, level, and surface sources. Each has a spectral dimension; there is one more level than layer, and the surface has no vertical extent.
_ext represents stellar radiation. It returns the solar source (inc_flux) which has a single spectral dimension.
Gas optics calculations also return the set of 1scl, 2str, or nstr optical properties depending on user desires.

Solving problems

The solvers depend on whether the sources are internal (lw for longwave) or external (sw for shortwave). Each takes a set of optical properties (e.g. noscat solvers use 1scl optical properties) and the corresponding set (int/ext) of source functions.

In practice shortwave problems always use a two-stream specification, while longwave solvers might use either a scalar or a two-stream specification.

Encapsulation

In the Fortran interface there is one class grouping together the optical properties and a second grouping together the longwave source functions; the shortwave source is passed around as a plain array. Using classes makes for a smaller number of variables and ensures the right grouping.

Does this approach make sense in Python too?

RobertPincus commented 7 months ago

We discussed this at a group meeting recently.

Conceptually we can think of each problem as having optical properties, sources (of radiation), and boundary conditions. The combination of optical properties and sources defines the kind of problem to be solved; it's also what comes back from the gas optics. Users will specify boundary conditions.

We could distinguish these, e.g.

gas_optics = rrtmgp.GasOptics(SCHEME_DATA)
atmosphere = xr.open_dataset(SOME_FILE) # temperature, pressure, gas concentrations 
optical_props, sources = gas_optics.gas_optics(atmosphere) 
bcs = atmosphere # with agreed names for surface properties, say
flux_up, flux_dn = rte.solve(optical_props, sources, bcs)

This makes everything clear. One clear disadvantage is that the optical_props, sources, and boundary conditions share coordinates in space and along the spectrum. To me that suggests putting them in the same dataset:

gas_optics = rrtmgp.GasOptics(SCHEME_DATA)
atmosphere = xr.open_dataset(SOME_FILE) # temperature, pressure, gas concentrations 
problem = gas_optics.gas_optics(atmosphere) 
problem["surface_emissivity"]  = 1.
flux_up, flux_dn = rte.solve(problem)

This is conceptually less clear and relies (more?) on conventions for arrays within datasets.

brendancol commented 3 months ago

@tcmetzger these are some nice "spelling" examples for us to implement

RobertPincus commented 3 months ago

@tcmetzger @sehnem One thought after discussions with @brendancol earlier today: optical properties and source functions both have a spectral dimension/discretization. In general, there are spectral "bands" that cover particular frequencies/wavelengths (and have starting and ending values of those coordinates); for the gas optics there may also be "g-points" within each band that aren't ordered in any particular way.

This comes up when creating a problem definition/set of optical properties from multiple components in the atmosphere (like gases and clouds) - not only do we have to ensure that the discretizations are the same but we need the ability to broadcasts values defined on bands to each g-point within the band. Like this, if we use a single dataset for the whole problem:

gas_optics = rrtmgp.GasOptics(SCHEME_DATA)
cloud_optics = rrtmgp.CloudOptics(MORE_SCHEME_DATA)

atmosphere = xr.open_dataset(SOME_FILE) # temperature, pressure, gas concentrations 
clouds = xr.open_dataset(SOME_OTHER_FILE) # cloud properties 

problem = gas_optics.gas_optics(atmosphere) 
# "problem" is defined on g-points 
cloud_optics(clouds).add_to(problem)  # or maybe problem += cloud_optics(clouds) 
#cloud_optics(clouds) is defined on bands 
problem["surface_emissivity"]  = 1.

flux_up, flux_dn = rte.solve(problem)

sehnem commented 3 weeks ago

@RobertPincus, I reviewed your suggestion and the current implementation, what do you think of the implementation bellow?

gas_optics = rrtmgp.GasOptics(SCHEME_DATA)
cloud_optics = rrtmgp.CloudOptics(MORE_SCHEME_DATA)

# Here the user can create different atmospheres and clouds to be used in multiple problems
atmosphere = gas_optics.load_atmosphere(SOME_FILE)
clouds = cloud_optics.load_clouds(SOME_OTHER_FILE)

# As the problem is part of RTE, we could create a class there with
# some defaults but that could be modified if needed, it could be 
# configured in the class initialization or by changing the attributes
problem = rte.Problem(atmosphere, clouds) # could be and other name used Problem as was used in the other example
problem.surface_emissivity = 1

# Configs for the parallelization could be added here
flux_up, flux_dn = rte.solve(problem)

I have not worked on the clouds yet, but if there is a high dependency on the atmosphere, we could also do something like

gas_optics = rrtmgp.GasOptics(SCHEME_DATA)
cloud_optics = rrtmgp.CloudOptics(MORE_SCHEME_DATA)

atmosphere = gas_optics.load_atmosphere(SOME_FILE)
clouds = cloud_optics.load_clouds(SOME_OTHER_FILE)
atmosphere.load_clouds(clouds)

problem = rte.Problem(atmosphere)
problem.surface_emissivity = 1

flux_up, flux_dn = rte.solve(problem)

Does it make sense?

@brendancol from a technical point of view, currently we are using xarray accessors, should we keep using it, should we use classes as Robert suggested, or could we use accessors in a layer underneath the classes?

RobertPincus commented 3 weeks ago

@sehnem If I understand right you are proposing a refinement - to make a Problem class associated with the RTE module. Do I have that right? This could be a good choice but we'll need to work some things through.

A complete description of a problem to be solved by RTE contains optical properties, source functions, and boundary conditions, each on a grid with spectral, vertical, and instance dimensions (ngpt, nlay, and ncol in the Fortran code). There are different variables associated with the source functions and boundary conditions for the LW and SW variants.

The cloud (and aerosol) optics code returns optical properties alone, while gas optics returns optical properties and source functions. We can provide default values for boundary conditions.

The normal course of things in the the Fortran is to compute the gas optics and source functions, possibly compute fluxes, compute cloud optics, update the problem definition, then compute fluxes.

Would it be Pythonic to define an rte.Problem class with three variants that encapsulates the optical properties (i.e. allows only certain combinations of arrays within the data set) and subclasses rte.Problem.LW and rte.problem.SW as containers for the combination of optical properties, source functions, and boundary conditions? cloud_optics() would return a Problem but gas_optics() would return a Problem.LW?

sehnem commented 3 weeks ago

The normal course of things in the the Fortran is to compute the gas optics and source functions, possibly compute fluxes, compute cloud optics, update the problem definition, then compute fluxes.

I was missing that, maybe it is better to put this on hold and implement after the cloud optics is at least partially done. I have not worked with that so I was not aware of it.

I will keep working on other fixes with results that are not matching using v0.18 RTE-RRTMGP and latest set of data, now the pyrte result for lw is exactly half of the expected in the data repository, not sure if there was some change but I will debug it.

RobertPincus commented 3 weeks ago

@sehnem With respect to the test data, there was a change in the RTE release in May (https://github.com/earth-system-radiation/rte-rrtmgp/pull/282/files) that might be relevant.

RobertPincus commented 3 weeks ago

@sehnem It may take a bit for the cloud optics to be merged to develop in the Fortran repo - there's an issue with the CI (not relevant for us but still a show-stopper). In terms of the design isn't it enough to know that cloud optics will return a set of optical properties?

sehnem commented 3 weeks ago

@RobertPincus I can work on that as the cloud optics will take some time. It is something that should no be too hard to change if needed, just thought that it may be better to do it as a last step it there was no blocker.

Thanks for pointing the changes from may, some of these functions were reimplemented in python so it is probably it.

RobertPincus commented 4 days ago

@sehnem @tcmetzger A revised spelling for the front end.

# Clear sky example
gas_optics_lw   = rrtmgp.GasOptics(SCHEME_DATA)
gas_optics_sw   = rrtmgp.GasOptics(OTHER_SCHEME_DATA)

atmosphere = xr.open_dataset(SOMEFILE)

# Demonstrating gas optics 

# Longwave examples 

# Adds a problem (tau), sources (lay_source, lev_source), and boundary conditions (surface_emissivity) to 'atmosphere'
gas_optics_lw.compute_gas_optics(atmosphere, problem_type="absorption") 

# Creates a new xr.Dataset with problem (tau), sources (lay_source, lev_source), and boundary conditions (surface_emissivity) 
# with coordinates and dimensions taken from 'atmosphere'
# 'absorption' is the default for LW problems 
lw_problem = gas_optics_lw.compute_gas_optics(atmosphere, problem_type="absorption", add_to_input = False) 

# Creates a new xr.Dataset with problem (tau, ssa, g), sources (lay_source, lev_source), and boundary conditions (surface_emissivity) 
# with coordinates and dimensions taken from 'atmosphere'
lw_problem = gas_optics_lw.compute_gas_optics(atmosphere, problem_type="two-stream", add_to_input = False) 

# Shortwave examples 
# Adds a problem (tau, ssa, g), sources (toa_source), and boundary conditions 
# (surface_albedo_direct, surface_albedo_diffuse, solar_zenith_angle) to 'atmosphere'
# 'two-stream' is the default for SW problems 
gas_optics_sw.compute_gas_optics(atmosphere) 

# Access the problem data: 
lw_problem["tau"] /= 2. 

# Examples of solution

# Appends DataArrays of (broadband) fluxes to the input dataset 
# Default is broadband (spectrally-integrated) fluxes 
# Should the fluxes have names showing SW or LW? 
rte.solve(problem)

flux_up, flux_down = rte.solve(problem, add_to_input = False) 

spec_flux_up, spec_flux_down = rte.solve(problem, add_to_input = False, spectrally_resolved = True) 

# Does this make sense? Or do we use .pipe() 
custom_out = rte.solve(problem, add_to_input = False, reduction = Callable)

sehnem commented 2 days ago

@RobertPincus I created a basic implementation for the LW problem, now I am using xarray objects as return types. The implementation followed your suggestions but I had to change a few things, instead of compute_gas_optics I used the accessor name gas_optics and compute. For RTE it is rte_solve as I am using just a function to select the right method.

Let me know if that looks good, the other attributes that are not in this example should be implemented as you suggested.

rte_rrtmgp_dir = download_rrtmgp_data()
input_dir = os.path.join(rte_rrtmgp_dir, "examples", "rfmip-clear-sky", "inputs")

# This is to load one of the default files that the library automatically downloads, but 
# it also accepts paths do netcdfs. 
# gas_optics_lw is a xarray dataset
gas_optics_lw = load_gas_optics(gas_optics_file=GasOpticsFiles.LW_G256)

atmosphere_path = os.path.join(input_dir, "multiple_input4MIPs_radiation_RFMIP_UColorado-RFMIP-1-2_none.nc")
atmosphere = xr.load_dataset(atmosphere_path).sel(expt=0)

# here it uses some default names for the variables, we need to implement support for custom dataset names
# and gas selections. It support add_to_input or the creation of a new dataset.
gas_optics_lw.gas_optics.compute(atmosphere, problem_type="absorption")

# The output of gas_optics.compute will have an attribute that will describe the problem, 
# LW, SW, absorption, 2-src, etc. And it will be used to call the function.
fluxes = rte_solve(atmosphere, add_to_input=False)

sehnem commented 17 hours ago

@RobertPincus about this last line, the reduction would be a method applied to the output, or something else?

custom_out = rte.solve(problem, add_to_input = False, reduction = Callable)

RobertPincus commented 17 hours ago

@sehnem Yes, that was the idea, but it's inspired by the Fortran where there are fewer options. It might be valuable here if providing a function makes it easier to parallelize the reduction over dask chunks. On the other hand if rte.solve(problem, add_to_input = False).pipe(reduction) has the same effect we wouldn't need to pass the function explicitly.

RobertPincus commented 17 hours ago

@sehnem I hadn't noticed your initial longwave implementation. The revised syntax gas_optics_lw.gas_optics.compute(atmosphere, problem_type="absorption") looks fine to me. Are the gas_optics.compute() and rte_solve() methods both able to append to the input data set and/or return a new xr.Dataset per user request?

sehnem commented 4 hours ago

@RobertPincus yes, both support appending or the return of only the results. I think that this reduction function could work in both ways. I will take a look at the fortran code to get some reference and test what would work the best.

earth-system-radiation / pyRTE-RRTMGP

Python front end - encapsulating variables #31