Make `xarray` datasets discoverable

jasongilman commented 6 years ago

There are a large and growing number of publicly-available datasets that are loadable into xarray from buckets in the Cloud. Currently, however, there is no effective way to discover these datasets.

Using standards like OGC Catalog Service the Web (CSW) and OpenSearch, it would be possible to discover these xarray datasets via sites like data.gov (and data.gov.uk, data.gov.au, etc) but it requires producing the ISO metadata which these sites consume.

It would also be possible to discover [xarray datasets via sites like Google's dataset search, but it would necessary to produce the json-ld metadata that these sites consume.

Since xarray preserves the content of datasets which follow the CF and ACDD metadata conventions, it should be possible to generate both types of metadata in a straightforward way from the xarray dataset object, using metadata tools that have already been developed for datasets that adhere to the CF conventions. The ncISO tool exists that generate ISO records from netCDF or OPeNDAP endpoints, so the mapping from CF/ACDD attributes to ISO could be reused for records from xarray. Similarly, there has been work already done to create nco-json metadata from netcdf files, a complete metadata representation from which the json-ld content could be extracted.

Proposed Work:

Develop code that integrates the nco-json spec into the xarray package, which represent the complete metadata of the xarray object.
Develop code that, from the complete nco-json metadata associated with xarray objects, generates the more restrictive ISO and json-ld metadata formats.

rabernat commented 6 years ago

There are a large and growing number of publicly-available datasets that are loadable into xarray from buckets in the Cloud.

Can you give some examples of this?

The ones I know about are the datasets we have put online in zarr format in Pangeo. (Some docs about this process here: http://pangeo.io/data.html#data-in-the-cloud). Cataloging these datasets is an open issue (https://github.com/pangeo-data/pangeo/issues/39)

The current problem with hosting xarray data in the cloud is that hdf does not play well with cloud storage. This is a technical obstacle that is being discussed in many places across xarray, zarr, netCDF, etc. That's why I'm curious about your claim that there are already a large number of publicly available cloud datasets that play well with xarray.

All that said, I am supportive of this idea in general.

apawloski commented 6 years ago

We were actually thinking about the Pangeo datasets. The term "large" is subjective of course, and large enough to warrant a catalog, as in: https://github.com/pangeo-data/pangeo/issues/39. We experimented with something along these lines a few weeks ago at the Pangeo workshop, https://gist.github.com/rsignell-usgs/88cfae22896bf9fed5bd36a6689e7210. The goal would be to facilitate discovery of these datasets through their attributes/metadata.

ESIPFed / NUMfocusFallDev

Make `xarray` datasets discoverable #6