Collection with monthly CESM output files (history files)

AJueling commented 4 years ago

We have many different CESM simulations and I would like to create an esm-intake collection of them. The output files are monthly mean netcdf files and contain many variables. I have created a collection.json file:

{
    "esmcat_version": "0.1.0",
    "id": "CESM_simulations",
    "description": "This is an ESM collection for CESM1 simulations.",
    "catalog_file": "simulations.csv",
    "attributes": [
      { "column_name": "component",  "vocabulary": ""},
      { "column_name": "frequency",  "vocabulary": ""},
      { "column_name": "experiment", "vocabulary": ""},
      { "column_name": "variable",   "vocabulary": ""}
    ],
    "assets": {
      "column_name": "path",
      "format": "netcdf"
    }
}

and with a simulations.csv:

component,frequency,experiment,path
ocn,monthly,CTRL,simulation1.pop.h.0001-01.nc
ocn,monthly,CTRL,simulation1.pop.h.0001-02.nc

I can create a catalogue cat = intake.open_esm_datastore('collection.json').search(experiment=['CTRL']) which results in

CESM_simulations-ESM Collection with 2 entries:
    > 1 component(s)
    > 1 frequency(s)
    > 1 experiment(s)
    > 2 path(s)

but when I create a dataset with dset_dict = cat.to_dataset_dict(cdf_kwargs={'decode_times': False}) it returns a dataset with only a single time coordinate:

resulting xarray dataset

calling `dset_dict['ocn.monthly.CTRL']` yields ``` Dimensions: (bnds: 2, d2: 2, nlat: 2400, nlon: 3600, time: 1, z_t: 42, z_t_150m: 12, z_w: 42, z_w_bot: 42, z_w_top: 42) Coordinates: * time (time) float64 7.302e+04 * z_t (z_t) float32 500.622 1506.873 ... 562499.9 587499.9 * z_t_150m (z_t_150m) float32 500.622 1506.873 ... 14895.824 * z_w (z_w) float32 0.0 1001.244 ... 549999.9 574999.9 * z_w_top (z_w_top) float32 0.0 1001.244 ... 549999.9 574999.9 * z_w_bot (z_w_bot) float32 1001.244 2012.502 ... 599999.9 ULONG (nlat, nlon) float64 ... ULAT (nlat, nlon) float64 ... TLONG (nlat, nlon) float64 ... TLAT (nlat, nlon) float64 ... Dimensions without coordinates: bnds, d2, nlat, nlon Data variables: time_bound (time, d2) float64 ... dz (z_t) float32 ... dzw (z_w) float32 ... KMT (nlat, nlon) float64 ... KMU (nlat, nlon) float64 ... REGION_MASK (nlat, nlon) float64 ... UAREA (nlat, nlon) float64 ... TAREA (nlat, nlon) float64 ... HU (nlat, nlon) float64 ... HT (nlat, nlon) float64 ... DXU (nlat, nlon) float64 ... DYU (nlat, nlon) float64 ... DXT (nlat, nlon) float64 ... DYT (nlat, nlon) float64 ... HTN (nlat, nlon) float64 ... HTE (nlat, nlon) float64 ... HUS (nlat, nlon) float64 ... HUW (nlat, nlon) float64 ... ANGLE (nlat, nlon) float64 ... ANGLET (nlat, nlon) float64 ... days_in_norm_year float64 ... grav float64 ... omega float64 ... radius float64 ... cp_sw float64 ... sound float64 ... vonkar float64 ... cp_air float64 ... rho_air float64 ... rho_sw float64 ... rho_fw float64 ... stefan_boltzmann float64 ... latent_heat_vapor float64 ... latent_heat_fusion float64 ... ocn_ref_salinity float64 ... sea_ice_salinity float64 ... T0_Kelvin float64 ... salt_to_ppt float64 ... ppt_to_salt float64 ... mass_to_Sv float64 ... heat_to_PW float64 ... salt_to_Svppt float64 ... salt_to_mmday float64 ... momentum_factor float64 ... hflux_factor float64 ... fwflux_factor float64 ... salinity_factor float64 ... sflux_factor float64 ... nsurface_t float64 ... nsurface_u float64 ... KE (time, z_t, nlat, nlon) float32 ... TEMP (time, z_t, nlat, nlon) float32 ... SALT (time, z_t, nlat, nlon) float32 ... SSH2 (time, nlat, nlon) float32 ... SHF (time, nlat, nlon) float32 ... SFWF (time, nlat, nlon) float32 ... EVAP_F (time, nlat, nlon) float32 ... PREC_F (time, nlat, nlon) float32 ... SNOW_F (time, nlat, nlon) float32 ... MELT_F (time, nlat, nlon) float32 ... ROFF_F (time, nlat, nlon) float32 ... SALT_F (time, nlat, nlon) float32 ... SENH_F (time, nlat, nlon) float32 ... LWUP_F (time, nlat, nlon) float32 ... LWDN_F (time, nlat, nlon) float32 ... MELTH_F (time, nlat, nlon) float32 ... IAGE (time, z_t, nlat, nlon) float32 ... WVEL (time, z_w_top, nlat, nlon) float32 ... UET (time, z_t, nlat, nlon) float32 ... VNT (time, z_t, nlat, nlon) float32 ... UES (time, z_t, nlat, nlon) float32 ... VNS (time, z_t, nlat, nlon) float32 ... PD (time, z_t, nlat, nlon) float32 ... HMXL (time, nlat, nlon) float32 ... XMXL (time, nlat, nlon) float32 ... TMXL (time, nlat, nlon) float32 ... HBLT (time, nlat, nlon) float32 ... XBLT (time, nlat, nlon) float32 ... TBLT (time, nlat, nlon) float32 ... SSH (time, nlat, nlon) float64 ... time_bnds (time, bnds) float64 ... TAUX (time, nlat, nlon) float64 ... TAUY (time, nlat, nlon) float64 ... UVEL (time, z_t, nlat, nlon) float64 ... VVEL (time, z_t, nlat, nlon) float64 ... Attributes: title: spinup_pd_maxcores_f05_t12 history: Thu Sep 14 23:06:30 2017: ncks -A /projects/0... Conventions: CF-1.0; http://www.cgd.ucar.edu/cms/eaton/net... contents: Diagnostic and Prognostic Variables source: CCSM POP2, the CCSM Ocean Component revision: $Id: tavg.F90 34115 2012-01-25 22:35:19Z njn01 $ calendar: All years have exactly 365 days. start_time: This dataset was created on 2017-04-15 at 12:... cell_methods: cell_methods = time: mean ==> the variable va... nsteps_total: 25052952 tavg_sum: 86399.99999999974 CDI: Climate Data Interface version 1.7.0 (http://... CDO: Climate Data Operators version 1.7.0 (http://... NCO: "4.6.0" history_of_appended_files: Thu Sep 14 23:06:30 2017: Appended file /proj... intake_esm_varname: None ```

How do I concatenate along the time axis?

andersy005 commented 4 years ago

@AJueling, do you mind if I transfer this issue to this https://github.com/NCAR/intake-esm-datastore repo instead? I am planning on commenting once it's there

AJueling commented 4 years ago

Thanks for the quick reply! I don't mind if you move it, of course. (I was not sure where to ask this in the first place.)

andersy005 commented 4 years ago

@AJueling,

Are you working with time-slices (history files i.e. do you have one time step in each file with a bunch of data variables) or time-series (multiple time steps with one data variable)?

As @matt-long pointed out in https://github.com/NCAR/intake-esm/issues/112

There is a widespread assumption in intake-esm that there is one variable per file. This precludes using the package with multi-variable files, such as those written directly by CESM.

Unfortunately, this issue of multi-variable files is still unresolved :(

How do I concatenate along the time axis?

If you were working with time-series (single data variable per file), the following would address the issue:

Add a time_range column in the csv that specifies the date ranges in each file.
Add an aggregation_control section to your collection.json:

{
  "esmcat_version": "0.1.0",
  "id": "CESM_simulations",
  "description": "This is an ESM collection for CESM1 simulations.",
  "catalog_file": "simulations.csv",
  "attributes": [
    {
      "column_name": "component",
      "vocabulary": ""
    },
    {
      "column_name": "frequency",
      "vocabulary": ""
    },
    {
      "column_name": "experiment",
      "vocabulary": ""
    },
    {
      "column_name": "variable",
      "vocabulary": ""
    },
    {
      " column_name": "time_range",
      "vocabulary": ""
    }
  ],
  "assets": {
    "column_name": "path",
    "format": "netcdf"
  },
  "aggregation_control": {
    "variable_column_name": "variable",
    "groupby_attrs": [
      "component",
      "experiment",
      "stream"
    ],
    "aggregations": [
      {
        "type": "union",
        "attribute_name": "variable"
      },
      {
        "type": "join_existing",
        "attribute_name": "time_range",
        "options": {
          "dim": "time",
          "coords": "minimal",
          "compat": "override"
        }
      }
    ]
  }
}

For reference, take a look at the collection for CESM2 runs (timeseries): https://github.com/NCAR/intake-esm-datastore/blob/master/catalogs/campaign-cesm2-cmip6-timeseries.json.

AJueling commented 4 years ago

@andersy005 thank you for the reply. I am indeed working with time slice files that contain many variables which is the standard output format of CESM as far as I know. It's good to know that it does not work for my use case and I will use a different approach. I suppose we can close this for now and I will follow @matt-long's issue for any updates.

andersy005 commented 4 years ago

It's likely that this issue is of interest to other users. So, Let's leave it open (as a reference) until the multi variable files are supported.

andersy005 commented 3 years ago

@AJueling, just wanted to let you know that we've been working on functionality for building and using catalogs for CESM runs. Recently, @mgrover1 put together a great blog post with details on how to build a catalog from CESM history files: https://ncar.github.io/esds/posts/ecgtools-history-files-example/

NCAR / intake-esm-datastore

Collection with monthly CESM output files (history files) #55