Replacing "gs" strings to run ML workflow on Stellar cluster

mrudko commented 1 year ago

Machine learning (ML) workflow consists of several stages, which involve running numerical model (FV3GFS) and traning ML model. Perfroming ML training, I realized that some of "gs"-strings ("gs" - Google Cloud) in the catalog.yaml file https://github.com/ai2cm/fv3net/blob/3b1ff5c9378f6a70c8b760e4f02df373d32de780/external/vcm/vcm/catalog.yaml#L363

need to be replaced with the pathes to my local home directory e.g.

urlpath: "/home/mr7417/ML_workflow/c48_fv3config/data/vcm_catalog/c12.zarr/"

In order to remain consistent with the existing structure of the code (i.e. the user should be able to run the ML workflow on both Google cloud platfrom and HPC cluster), some more elaborate (or elegant) code changes need to be introduced.

spencerkclark commented 1 year ago

Thanks @mrudko -- I know it's been some time since you fixed this locally, but I think you first encountered this catalog issue in training an ML model and generating an offline report. Both of these require loading in grid data (e.g. when loading in data via batches here, or loading grid data here). I do not believe the nudged run relies on any of the datasets listed in the catalog.

Regardless, it is indeed a design question we'll need to ponder how to address. Perhaps similar to what is done in the case of computing the prognostic run diagnostics, we could add an option to the command line tools for ML training and the offline report to specify a different reference catalog than the default one packaged with vcm. Adding more options to these tools is maybe not ideal, but this feels like the most explicit way of addressing this issue.

We'll also want to make sure to document what the minimal necessary datasets are -- I think it is just the grid data, land sea mask, and wind rotation matrices at the resolutions one is interested in. That way other users will know which data they will need to locally mirror for these tools to work out of the box.

spencerkclark commented 1 year ago

@oliverwm1 I think once @mrudko addresses #2202 and #2197 this might be a good next task for him. I've put a little more thought into this after my previous comment and include some updated discussion here. Might you have any opinions on how we should go about addressing this?

ML training

For the ML training I can think of a couple more targeted options than adding a top-level command line argument, which are probably better:

One would be to add an optional attribute to the BatchesFromMapperConfig dataclass for the catalog path and propagate that information down into add_grid_info and add_wind_rotation_info in the batches_from_mapper call.
Another would be to make it easier to merge in a static grid_data.zarr store that we could simply output from the nudged run that included the longitude, latitude, and wind rotation information. This would allow us to specify needs_grid: False in our data configs and omit referencing the catalog at all. In principle I like this approach the most, but it's not without a couple snags[^1].

The advantage of these is that any batch loading in the offline report that requires the grid data would also automatically work in the same way.

Offline report

For the offline report we may still need to add a command line argument for a new catalog in the compute step, since the grid data is loaded in a couple other places: https://github.com/ai2cm/fv3net/blob/b8146cf7979901e7b78613a298b1973b48ca80a1/workflows/diagnostics/fv3net/diagnostics/offline/compute.py#L311 https://github.com/ai2cm/fv3net/blob/b8146cf7979901e7b78613a298b1973b48ca80a1/workflows/diagnostics/fv3net/diagnostics/offline/compute.py#L261

[^1]: While the datasets option exists in open_nudge_to_fine, which should help with this, note that it does not work out of the box for two reasons:

(1) These are fortran diagnostics and therefore are output with dimensions of `"grid_xt"` and `"grid_yt"`.  This would be straightforward to rectify with a call to `vcm.fv3.standardize_fv3_diagnostics` before merging the datasets [here](https://github.com/ai2cm/fv3net/blob/b8146cf7979901e7b78613a298b1973b48ca80a1/external/loaders/loaders/mappers/_nudged/_nudged.py#L157-L162).

(2) Static datasets output by the fortran model still awkwardly contain a single-valued time coordinate, despite none of the data variables having a time dimension.  This interferes with the inner join when merging the datasets together (basically we end up dropping all the data we are interested in).  It's not pretty, but one way to hack around this would be to drop any time coordinate from a dataset prior to merging if none of its data variables contain a time dimension. 

See for example this dataset:

```
zarrdump gs://vcm-ml-scratch/2023-04-26-lightweight-workflow/2023-04-26/c12-nudged/fv3gfs_run/grid_data.zarr
<xarray.Dataset>
Dimensions:                 (tile: 6, grid_yt: 12, grid_xt: 12, component: 3, grid_x: 13, grid_y: 13, time: 1)
Coordinates:
  * component               (component) float64 1.0 2.0 3.0
  * grid_x                  (grid_x) float64 1.0 2.0 3.0 4.0 ... 11.0 12.0 13.0
  * grid_xt                 (grid_xt) float64 1.0 2.0 3.0 4.0 ... 10.0 11.0 12.0
  * grid_y                  (grid_y) float64 1.0 2.0 3.0 4.0 ... 11.0 12.0 13.0
  * grid_yt                 (grid_yt) float64 1.0 2.0 3.0 4.0 ... 10.0 11.0 12.0
  * time                    (time) object 2016-08-05 03:00:00
Dimensions without coordinates: tile
Data variables:
    area                    (tile, grid_yt, grid_xt) float32 dask.array<chunksize=(6, 12, 12), meta=np.ndarray>
    eastward_wind_u_coeff   (tile, grid_yt, grid_xt) float32 dask.array<chunksize=(6, 12, 12), meta=np.ndarray>
    eastward_wind_v_coeff   (tile, grid_yt, grid_xt) float32 dask.array<chunksize=(6, 12, 12), meta=np.ndarray>
    lat                     (tile, grid_yt, grid_xt) float32 dask.array<chunksize=(6, 12, 12), meta=np.ndarray>
    latb                    (tile, grid_y, grid_x) float32 dask.array<chunksize=(6, 13, 13), meta=np.ndarray>
    lon                     (tile, grid_yt, grid_xt) float32 dask.array<chunksize=(6, 12, 12), meta=np.ndarray>
    lon_unit_vector         (tile, component, grid_yt, grid_xt) float32 dask.array<chunksize=(6, 3, 12, 12), meta=np.ndarray>
    lonb                    (tile, grid_y, grid_x) float32 dask.array<chunksize=(6, 13, 13), meta=np.ndarray>
    northward_wind_u_coeff  (tile, grid_yt, grid_xt) float32 dask.array<chunksize=(6, 12, 12), meta=np.ndarray>
    northward_wind_v_coeff  (tile, grid_yt, grid_xt) float32 dask.array<chunksize=(6, 12, 12), meta=np.ndarray>
    x_unit_vector           (tile, component, grid_yt, grid_xt) float32 dask.array<chunksize=(6, 3, 12, 12), meta=np.ndarray>
    y_unit_vector           (tile, component, grid_yt, grid_xt) float32 dask.array<chunksize=(6, 3, 12, 12), meta=np.ndarray>
```

spencerkclark commented 1 year ago

I chatted some with @oliverwm1 offline and he brought up the good point that any hacks we needed to include in the open_nudge_to_fine function to accommodate static datasets would also potentially need to be propagated into our other loading functions to support the same functionality.

Therefore I think we are leaning towards the first option (adding an optional attribute to the BatchesFromMapperConfig dataclass for the catalog path) for ML training. For the offline report he agreed that adding a command-line argument for the catalog path seems like a reasonable solution, mimicking what is done for the online report.

oliverwm1 commented 1 year ago

Thanks for the write up @spencerkclark! Sounds like a good plan of action to me.

ai2cm / fv3net

Replacing "gs" strings to run ML workflow on Stellar cluster #2204

ML training

Offline report