Open jasongilman opened 6 years ago
There are a large and growing number of publicly-available datasets that are loadable into xarray from buckets in the Cloud.
Can you give some examples of this?
The ones I know about are the datasets we have put online in zarr format in Pangeo. (Some docs about this process here: http://pangeo.io/data.html#data-in-the-cloud). Cataloging these datasets is an open issue (https://github.com/pangeo-data/pangeo/issues/39)
The current problem with hosting xarray data in the cloud is that hdf does not play well with cloud storage. This is a technical obstacle that is being discussed in many places across xarray, zarr, netCDF, etc. That's why I'm curious about your claim that there are already a large number of publicly available cloud datasets that play well with xarray.
All that said, I am supportive of this idea in general.
We were actually thinking about the Pangeo datasets. The term "large" is subjective of course, and large enough to warrant a catalog, as in: https://github.com/pangeo-data/pangeo/issues/39. We experimented with something along these lines a few weeks ago at the Pangeo workshop, https://gist.github.com/rsignell-usgs/88cfae22896bf9fed5bd36a6689e7210. The goal would be to facilitate discovery of these datasets through their attributes/metadata.
There are a large and growing number of publicly-available datasets that are loadable into xarray from buckets in the Cloud. Currently, however, there is no effective way to discover these datasets.
Using standards like OGC Catalog Service the Web (CSW) and OpenSearch, it would be possible to discover these
xarray
datasets via sites like data.gov (and data.gov.uk, data.gov.au, etc) but it requires producing the ISO metadata which these sites consume.It would also be possible to discover [xarray datasets via sites like Google's dataset search, but it would necessary to produce the json-ld metadata that these sites consume.
Since
xarray
preserves the content of datasets which follow the CF and ACDD metadata conventions, it should be possible to generate both types of metadata in a straightforward way from thexarray
dataset object, using metadata tools that have already been developed for datasets that adhere to the CF conventions. The ncISO tool exists that generate ISO records from netCDF or OPeNDAP endpoints, so the mapping from CF/ACDD attributes to ISO could be reused for records fromxarray
. Similarly, there has been work already done to createnco-json
metadata from netcdf files, a complete metadata representation from which thejson-ld
content could be extracted.Proposed Work:
Develop code that integrates the
nco-json
spec into thexarray
package, which represent the complete metadata of thexarray
object.Develop code that, from the complete
nco-json
metadata associated withxarray
objects, generates the more restrictiveISO
andjson-ld
metadata formats.