Disambiguate data with the same variable name

aekiss commented 5 years ago

when I try

import cosima_cookbook as cc
session = cc.database.create_session()
sea_level = cc.querying.getvar('01deg_jra55v13_iaf','eta_t',session)

I get

ValueError                                Traceback (most recent call last)
<ipython-input-2-777fead40ac4> in <module>
      1 import cosima_cookbook as cc
      2 session = cc.database.create_session()
----> 3 sea_level = cc.querying.getvar('01deg_jra55v13_iaf','eta_t',session)

/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.04/lib/python3.6/site-packages/cosima_cookbook/querying.py in getvar(expt, variable, session, ncfile, n, start_time, end_time, chunks, time_units, offset, decode_times, check_present)
     94                            chunks=file_chunks,
     95                            decode_times=False,
---> 96                            preprocess=lambda d: d[variable].to_dataset() if variable not in d.coords else d)
     97 
     98     # handle time offsetting and decoding

/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.04/lib/python3.6/site-packages/xarray/backends/api.py in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, lock, data_vars, coords, autoclose, parallel, **kwargs)
    717             data_vars=data_vars, coords=coords,
    718             infer_order_from_coords=infer_order_from_coords,
--> 719             ids=ids)
    720     except ValueError:
    721         for ds in datasets:

/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.04/lib/python3.6/site-packages/xarray/core/combine.py in _auto_combine(datasets, concat_dims, compat, data_vars, coords, infer_order_from_coords, ids)
    551     # Repeatedly concatenate then merge along each dimension
    552     combined = _combine_nd(combined_ids, concat_dims, compat=compat,
--> 553                            data_vars=data_vars, coords=coords)
    554     return combined
    555 

/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.04/lib/python3.6/site-packages/xarray/core/combine.py in _combine_nd(combined_ids, concat_dims, data_vars, coords, compat)
    473                                                          data_vars=data_vars,
    474                                                          coords=coords,
--> 475                                                          compat=compat)
    476     combined_ds = list(combined_ids.values())[0]
    477     return combined_ds

/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.04/lib/python3.6/site-packages/xarray/core/combine.py in _auto_combine_all_along_first_dim(combined_ids, dim, data_vars, coords, compat)
    491         datasets = combined_ids.values()
    492         new_combined_ids[new_id] = _auto_combine_1d(datasets, dim, compat,
--> 493                                                     data_vars, coords)
    494     return new_combined_ids
    495 

/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.04/lib/python3.6/site-packages/xarray/core/combine.py in _auto_combine_1d(datasets, concat_dim, compat, data_vars, coords)
    509         concatenated = [_auto_concat(list(ds_group), dim=dim,
    510                                      data_vars=data_vars, coords=coords)
--> 511                         for id, ds_group in grouped_by_vars]
    512     else:
    513         concatenated = datasets

/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.04/lib/python3.6/site-packages/xarray/core/combine.py in <listcomp>(.0)
    509         concatenated = [_auto_concat(list(ds_group), dim=dim,
    510                                      data_vars=data_vars, coords=coords)
--> 511                         for id, ds_group in grouped_by_vars]
    512     else:
    513         concatenated = datasets

/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.04/lib/python3.6/site-packages/xarray/core/combine.py in _auto_concat(datasets, dim, data_vars, coords)
    363                                  'concatenate: %s' % concat_dims)
    364             elif len(concat_dims) == 0:
--> 365                 raise ValueError('cannot infer dimension to concatenate: '
    366                                  'supply the ``concat_dim`` argument '
    367                                  'explicitly')

ValueError: cannot infer dimension to concatenate: supply the ``concat_dim`` argument explicitly

angus-g commented 5 years ago

As a first thought, maybe it'd be worth splitting up the querying function a bit so it's a bit easier to follow what's going on. If I run the query manually, I get the following file list:

['output019/ocean/rregionocean_daily.nc',
 'output017/ocean/rregionocean_daily.nc',
 'output020/ocean/rregionocean_daily.nc',
 'output018/ocean/rregionocean_daily.nc',
 'output001/ocean/ocean_daily.nc',
...

This error is because the "rregionocean" files have a length 0 "time" dimension (which also doesn't appear to be a record dimension), so xarray doesn't pick it up as a concatenation axis.

It doesn't actually seem like these files are useful in the first place, so perhaps we shouldn't have even picked them up during indexing? Perhaps it'd be useful to be able to "unindex" specific files. I'm not sure whether this issue should be motivation for restricting queries based on file paths themselves, because I think we're trying to avoid that to abstract away the underlying layout of files, etc.

AndyHoggANU commented 5 years ago

Aah, OK, I see what is happening. These files are actually useful - they are regional outputs, but they are handled slightly differently in MOM5, which is presumably why the time dimension is having trouble. It might need some investigation as to what information these files have in them...

aidanheerdegen commented 5 years ago

If the same variables exist in different files there needs to be some way to disambiguate them. In some cases they may have different time resolution, so can be differentiated in that way. In this case they have different grids.

Angus and I had a chat about this a while back, in a different context, but I can see now this is an absolute requirement. The dimensions of the variables are in the database, but it needs some mechanism for selecting them.

Thinking more broadly, we also discussed some way of uniquely identifying grids so we could say with confidence that data from different collections were on the same grid. Or perhaps more usefully, specify that the data returned match a certain grid. Say for comparing a regridded obs product with model output.

aidanheerdegen commented 5 years ago

The old way of disambiguating variables was to specify the filename, e.g. ocean.nc, ocean_month.nc. These have no informational value, are completely arbitrary and unintuitive for those not familiar with how the data has been saved.

Good data discovery is the best way to address this, but it is also desirable to specify the very minimum required to get a unique dataset. With this in mind it would be good if the library detects if a unique dataset will be returned, and if not warn the user what additional constraints are required, and list the duplicate values from which they can choose constraints.

angus-g commented 5 years ago

Since this is still popping up through various confusing errors (concatenation, chunking, etc.), I think it'd be good to fail in a nice way if:

there are multiple variables from the variables table selected by the query (e.g. salt from ocean.nc vs. a restart file, where the restart file doesn't have the same long_name or standard_name)
dimensions across the selected variables aren't consistent (regional outputs may have e.g. xt_ocean_sub01, or timeseries vs. static)
are there any other inconsistencies?

Additionally, and tangentially, a warning if chunking varies across the dataset. This isn't necessarily an error (we'll just pick the chunking from the first file), but it may be much less efficient to do significant on-the-fly rechunking.

jmunroe commented 5 years ago

I agree failure with an error message is better than hard coded heuristics for the query to try to guess the right files.

On Tue, Jul 16, 2019 at 11:14 PM Angus Gibson notifications@github.com wrote:

Since this is still popping up through various confusing errors (concatenation, chunking, etc.), I think it'd be good to fail in a nice way if:

there are multiple variables from the variables table selected by the query (e.g. salt from ocean.nc vs. a restart file, where the restart file doesn't have the same long_name or standard_name)

dimensions across the selected variables aren't consistent (regional outputs may have e.g. xt_ocean_sub01, or timeseries vs. static)

are there any other inconsistencies?

Additionally, and tangentially, a warning if chunking varies across the dataset. This isn't necessarily an error (we'll just pick the chunking from the first file), but it may be much less efficient to do significant on-the-fly rechunking.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/COSIMA/cosima-cookbook/issues/137?email_source=notifications&email_token=ABPFFOZZ66R6QA5R6QANTKLP72MBNA5CNFSM4H3O2I62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2C7EWA#issuecomment-512094808, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPFFO54XOD5F242DLKORF3P72MBNANCNFSM4H3O2I6Q .

-- JAMES MUNROE | ASSOCIATE PROFESSOR

Department of Physics and Physical Oceanography Memorial University of Newfoundland 230 Elizabeth Avenue St. John’s, Newfoundland, Canada A1C 5S7 Chemistry and Physics Building | Room C 4060 T 709 864 7362 | M 709 771 0450

www.physics.mun.ca

jmunroe commented 5 years ago

I am not sure if there are edge cases that I am missing, but it looks like it will be enough to use the time frequency and variable name to uniquely identify a variable for a given experiment.

aidanheerdegen commented 5 years ago

For a well constructed experiment I agree @jmunroe, however there is no restriction on saving the same variable with the same frequency in two different files. I'm not advocating we support that, but you'd definitely want to check that wasn't the case. I don't know what xarray would do if you asked it to load two identical datasets in one open_mfdataset call.

jmunroe commented 5 years ago

With the new function cc.querying.get_variables() we can see the source of the problem, same variable at multiple diagnostic frequenceis:

import cosima_cookbook as cc
session = cc.database.create_session()
df = cc.querying.get_variables(session, '01deg_jra55v13_iaf')
df[df.name=='eta_t']

	name	frequency	ncfile	# ncfiles	time_start
eta_t	None	output018/ocean/rregionocean_daily.nc	4	None	None
eta_t	1 daily	output197/ocean/ocean_daily.nc	197	1984-12-30 00:00:00	2017-12-30 00:00:00
eta_t	1 monthly	output016/ocean/ocean_month.nc	16	1984-12-30 00:00:00	1987-12-30 00:00:00

So if we add an optional frequency= parameter to get_variables() we can have:

import cosima_cookbook as cc
session = cc.database.create_session()
sea_level = cc.querying.getvar('01deg_jra55v13_iaf', 'eta_t', session,
                              frequency = '1 monthly')

then we get

<xarray.DataArray 'eta_t' (time: 36, yt_ocean: 2700, xt_ocean: 3600)>
dask.array<shape=(36, 2700, 3600), dtype=float32, chunksize=(1, 675, 900)>
Coordinates:
  * xt_ocean  (xt_ocean) float64 -279.9 -279.8 -279.7 ... 79.75 79.85 79.95
  * yt_ocean  (yt_ocean) float64 -81.11 -81.07 -81.02 ... 89.89 89.94 89.98
  * time      (time) object 1985-01-14 12:00:00 ... 1987-12-14 12:00:00
Attributes:
    long_name:      surface height on T cells [Boussinesq (volume conserving)...
    units:          meter
    valid_range:    [-1000.  1000.]
    cell_methods:   time: mean
    time_avg_info:  average_T1,average_T2,average_DT
    coordinates:    geolon_t geolat_t

as expected.

To know which are valid diagnostic frequencies we may use:

cc.querying.get_frequencies(session, experiment='01deg_jra55v13_iaf')

	frequency
0	None
1	1 daily
2	1 monthly
3	static

jmunroe commented 5 years ago

@aidanheerdegen is correct: is possible to have the same variable, at the same frequency in multiple filenames for given experiment.

Example:

experiment = '025deg_jra55v13_ryf9091_gmredi6'
sea_level = cc.querying.getvar(experiment, 'eta_t', session, frequency = '1 monthly')

This experiment has ncfiles of both form output%d%d%d/ocean/ocean_month.nc and `output%d%d%d/ocean/ocean_snapshot.nc over the same times, at the same frequency, and both contain the variable eta_t. The call to cc.querying.getvar() appears to succeed though, which is strange.

aidanheerdegen commented 5 years ago

It is possible to have pretty much anything. It may be that we need to define what we would regard as a well designed experiment, and that is that cosima cookbook supports.

Seeing as MOM supports averaging over a time period, or instantaneous values, it may be that those could not be disambiguated.

I also had a suggestion, to "hash" the grid variables in some way so as to uniquely identify them. This would allow regional outputs to be separated from global ones with the same variables and frequency. This would also allow easy inter-experiment comparisons, as it could find all data on a shared grid.

In order to expose grids to users, it would be necessary to give them an identifier that was sufficiently easy to type. Could default to "nx.ny.nz.####" where "####" is the first four digits of the grid hash. The name could be altered for readability and to aid discovery. In the grid table could have an information column with some longer description, again to aid with discovery.

aekiss commented 5 years ago

MOM supports not just averages and instantaneous values, but also rms, min, max, diurnal average and raising to a power - see https://github.com/mom-ocean/MOM5/blob/64990e1de853a175335848b67b580363053a79b4/src/shared/diag_manager/diag_table.F90#L150

jmunroe commented 5 years ago

So we should also be tracking the attribute cell_methods: time: mean or cell_methods: time: point to be able to distinguish variables.

For my example with ocean_month.nc and ocean_snapshot.nc, Xarray merges them together because the times are interleaved between the two dataset. The cell_method attribute is stripped out -- perhaps an indicator that something is wrong and a warning should be given to the user?

aidanheerdegen commented 5 years ago

Good point @jmunroe. I agree that cell_methods is required for disambiguation in that case. I guess it needs to be added to the DB @angus-g

AndyHoggANU commented 5 years ago

I agree that cell_methods could be in important disambiguator. [Is that a new word BTW?]

A key thing that we have to remember here is how to make it easy on the user. We really want to eliminate the ncFiles argument to make it easier on the user, but if they have to know, in advance, an arcane argument for cell_methods, have we really progressed?

One option here is that, for most variables, all this gets hidden from the user, unless they specifically query it. Then, we ask the cc.querying.getvar() to check for data ambiguity. If there is any ambiguity, it prompts the user, and suggests possible disambiguators. Is this option viable, or worthwhile?

aidanheerdegen commented 4 years ago

I am taking a look at this as I need it for data discovery purposes.

COSIMA / cosima-cookbook

Disambiguate data with the same variable name #137