Closed mrudko closed 1 year ago
Thanks @mrudko -- I know it's been some time since you fixed this locally, but I think you first encountered this catalog issue in training an ML model and generating an offline report. Both of these require loading in grid data (e.g. when loading in data via batches here, or loading grid data here). I do not believe the nudged run relies on any of the datasets listed in the catalog.
Regardless, it is indeed a design question we'll need to ponder how to address. Perhaps similar to what is done in the case of computing the prognostic run diagnostics, we could add an option to the command line tools for ML training and the offline report to specify a different reference catalog than the default one packaged with vcm. Adding more options to these tools is maybe not ideal, but this feels like the most explicit way of addressing this issue.
We'll also want to make sure to document what the minimal necessary datasets are -- I think it is just the grid data, land sea mask, and wind rotation matrices at the resolutions one is interested in. That way other users will know which data they will need to locally mirror for these tools to work out of the box.
@oliverwm1 I think once @mrudko addresses #2202 and #2197 this might be a good next task for him. I've put a little more thought into this after my previous comment and include some updated discussion here. Might you have any opinions on how we should go about addressing this?
For the ML training I can think of a couple more targeted options than adding a top-level command line argument, which are probably better:
BatchesFromMapperConfig
dataclass for the catalog path and propagate that information down into add_grid_info
and add_wind_rotation_info
in the batches_from_mapper
call.grid_data.zarr
store that we could simply output from the nudged run that included the longitude, latitude, and wind rotation information. This would allow us to specify needs_grid: False
in our data configs and omit referencing the catalog at all. In principle I like this approach the most, but it's not without a couple snags[^1].The advantage of these is that any batch loading in the offline report that requires the grid data would also automatically work in the same way.
For the offline report we may still need to add a command line argument for a new catalog in the compute step, since the grid data is loaded in a couple other places: https://github.com/ai2cm/fv3net/blob/b8146cf7979901e7b78613a298b1973b48ca80a1/workflows/diagnostics/fv3net/diagnostics/offline/compute.py#L311 https://github.com/ai2cm/fv3net/blob/b8146cf7979901e7b78613a298b1973b48ca80a1/workflows/diagnostics/fv3net/diagnostics/offline/compute.py#L261
[^1]: While the datasets
option exists in open_nudge_to_fine
, which should help with this, note that it does not work out of the box for two reasons:
(1) These are fortran diagnostics and therefore are output with dimensions of `"grid_xt"` and `"grid_yt"`. This would be straightforward to rectify with a call to `vcm.fv3.standardize_fv3_diagnostics` before merging the datasets [here](https://github.com/ai2cm/fv3net/blob/b8146cf7979901e7b78613a298b1973b48ca80a1/external/loaders/loaders/mappers/_nudged/_nudged.py#L157-L162).
(2) Static datasets output by the fortran model still awkwardly contain a single-valued time coordinate, despite none of the data variables having a time dimension. This interferes with the inner join when merging the datasets together (basically we end up dropping all the data we are interested in). It's not pretty, but one way to hack around this would be to drop any time coordinate from a dataset prior to merging if none of its data variables contain a time dimension.
See for example this dataset:
```
zarrdump gs://vcm-ml-scratch/2023-04-26-lightweight-workflow/2023-04-26/c12-nudged/fv3gfs_run/grid_data.zarr
<xarray.Dataset>
Dimensions: (tile: 6, grid_yt: 12, grid_xt: 12, component: 3, grid_x: 13, grid_y: 13, time: 1)
Coordinates:
* component (component) float64 1.0 2.0 3.0
* grid_x (grid_x) float64 1.0 2.0 3.0 4.0 ... 11.0 12.0 13.0
* grid_xt (grid_xt) float64 1.0 2.0 3.0 4.0 ... 10.0 11.0 12.0
* grid_y (grid_y) float64 1.0 2.0 3.0 4.0 ... 11.0 12.0 13.0
* grid_yt (grid_yt) float64 1.0 2.0 3.0 4.0 ... 10.0 11.0 12.0
* time (time) object 2016-08-05 03:00:00
Dimensions without coordinates: tile
Data variables:
area (tile, grid_yt, grid_xt) float32 dask.array<chunksize=(6, 12, 12), meta=np.ndarray>
eastward_wind_u_coeff (tile, grid_yt, grid_xt) float32 dask.array<chunksize=(6, 12, 12), meta=np.ndarray>
eastward_wind_v_coeff (tile, grid_yt, grid_xt) float32 dask.array<chunksize=(6, 12, 12), meta=np.ndarray>
lat (tile, grid_yt, grid_xt) float32 dask.array<chunksize=(6, 12, 12), meta=np.ndarray>
latb (tile, grid_y, grid_x) float32 dask.array<chunksize=(6, 13, 13), meta=np.ndarray>
lon (tile, grid_yt, grid_xt) float32 dask.array<chunksize=(6, 12, 12), meta=np.ndarray>
lon_unit_vector (tile, component, grid_yt, grid_xt) float32 dask.array<chunksize=(6, 3, 12, 12), meta=np.ndarray>
lonb (tile, grid_y, grid_x) float32 dask.array<chunksize=(6, 13, 13), meta=np.ndarray>
northward_wind_u_coeff (tile, grid_yt, grid_xt) float32 dask.array<chunksize=(6, 12, 12), meta=np.ndarray>
northward_wind_v_coeff (tile, grid_yt, grid_xt) float32 dask.array<chunksize=(6, 12, 12), meta=np.ndarray>
x_unit_vector (tile, component, grid_yt, grid_xt) float32 dask.array<chunksize=(6, 3, 12, 12), meta=np.ndarray>
y_unit_vector (tile, component, grid_yt, grid_xt) float32 dask.array<chunksize=(6, 3, 12, 12), meta=np.ndarray>
```
I chatted some with @oliverwm1 offline and he brought up the good point that any hacks we needed to include in the open_nudge_to_fine
function to accommodate static datasets would also potentially need to be propagated into our other loading functions to support the same functionality.
Therefore I think we are leaning towards the first option (adding an optional attribute to the BatchesFromMapperConfig
dataclass for the catalog path) for ML training. For the offline report he agreed that adding a command-line argument for the catalog path seems like a reasonable solution, mimicking what is done for the online report.
Thanks for the write up @spencerkclark! Sounds like a good plan of action to me.
Machine learning (ML) workflow consists of several stages, which involve running numerical model (FV3GFS) and traning ML model. Perfroming ML training, I realized that some of "gs"-strings ("gs" - Google Cloud) in the catalog.yaml file https://github.com/ai2cm/fv3net/blob/3b1ff5c9378f6a70c8b760e4f02df373d32de780/external/vcm/vcm/catalog.yaml#L363
need to be replaced with the pathes to my local home directory e.g.
urlpath: "/home/mr7417/ML_workflow/c48_fv3config/data/vcm_catalog/c12.zarr/"
In order to remain consistent with the existing structure of the code (i.e. the user should be able to run the ML workflow on both Google cloud platfrom and HPC cluster), some more elaborate (or elegant) code changes need to be introduced.