Systematic identification of variables from an xarray Dataset

jthielen commented 6 years ago

Corresponding to #860, it would seem useful to also be able to systematically identify variables from an xarray Dataset. A simple use-case would be something like what motivated this issue, #662, where we want to identify each of the components of the 3D wind field and then do some calculations on those. This also would likely be a prerequisite for #3 (whenever enough pieces are in place for that to be implemented).

A initial approach could be simply searching for the standard_name attribute and strictly adhering to the CF Standard Name list, while giving some option to the user to supply a dictionary to fill standard names where they are missing. However, would there be cases where we don't have a CF standard name for the quantity we want? Or, should there be some kind of automatic processing to fill in for missing standard_name attributes? But, then again, anything too much more flexible/complex would likely become even messier than systematic coordinate identification ended up being.

dopplershift commented 6 years ago

Is it worth allowing overwriting rather than telling people to do:

my_data.attrs()['standard_name'] = 'air_temperature'

jthielen commented 6 years ago

I'm not sure...that definitely seems like a good approach for DataArrays, but since (I'd presume) this is most useful on Datasets, that approach would end up like

my_data['temperature_isobaric'].attrs()['standard_name'] = 'air_temperature'
my_data['relative_humidity_isobaric'].attrs()['standard_name'] = 'relative_humidity'
my_data['geopotential_height_isobaric'].attrs()['standard_name'] = 'geopotential_height'

versus something less verbose such as

data.metpy.parse_cf(variables={'temperature_isobaric': 'air_temperature',
                               'relative_humidity_isobaric': 'relative_humidity',
                               'geopotential_height_isobaric': 'geopotential_height'})

I'd prefer the second, but what do you think?

dopplershift commented 6 years ago

Do you have a current set of data where this feature is necessary?

jthielen commented 6 years ago

In regards to the feature of systematic identification itself, it's mostly just the motivating example mentioned above at this point, but I could also see it opening up possibilities for calculations in the future if the user just passed a dataset, and the function could pull out what it needed.

In regards to filling in the standard_name, most sets of data I've been working with would need this, especially since most of the GRIB-converted data coming from THREDDS servers I've used are missing the standard_name attribute (this includes the NARR and Irma GFS examples in staticdata). Also, no surprise, but non-post-processed WRF output seems to lack it as well.

But, based on actually looking into this now and finding how common it is for datasets to be missing the standard_name attribute, would it be necessary to have programmatic ways of identifying the type of variable for this to be practical? Or is a different approach not based on standard_name needed? (If so either way, it seems like something that would take too much effort to be worked on right now.)

Unidata / MetPy

Systematic identification of variables from an xarray Dataset #886