Earth science data catalogue

PeterDSteinberg commented 7 years ago

earthio should be a data catalogue (see also the wiki notes from meeting yesterday):

[ ] Provide tools for downloading Earth science data sets
[ ] Provide a yaml spec method of downloading data that is additive in value towards the elm-main yaml spec approach for ML (elm-main is temporarily deprecated during refactoring)
Use the following terminology to describe the catalogue system:
- [ ] features: Generally these are spatial entities where data are reported, such as a lake's name; a state name; or other polygon, line or point. This could also include conceptual geography terms like "hot desert" for data that has been generalized to most hot deserts. I think we may want to consider changing the name features for this concept as features has several meanings, e.g. feature engineering in ML, features of a software, or geomorphic feature identification. Suggestions on an alternate name for the idea or should we leave it how it is? ( @dharhas @jbednar @philippjfr @gbrener )
- [ ] providers: A provider may be an organization, such as NASA or NOAA
- [ ] services: services are the distinct data sources of a provider, such as specific NASA dataset with related URLs
- [ ] parameters: These are the names and related metadata for measurements, e.g. daily average temperature or hourly rainfall sums. A few considerations:
- [ ] Keep track of units: Units may be reported outside the data, such as a separate README url. Units may be also atypical, such as rainfall in inches as an integer that must be multiplied by 0.001 (legacy of fortran memory management decisions).
- [ ] We need to plan around the ambiguity of the definition of a parameter, such as
  - [ ] "daily average" being 5 am to 5 am days in a given time zone or midnight to midnight
  - [ ] "hourly rainfall" being the integral inches over the preceding hour vs. the instantaneous intensity in inches/hour
- [ ] collections: To do an analysis, one might have one or more collections where a collection is a group of parameter / feature combinations and the related metadata, service info, and options on how to save the data locally.

Handle the following data downloading concerns:

[ ] How do we determine whether a service / feature combination actually has the data expected. For example, the meta level information may indicate USGS has water flow information for 1995 to 2010 for a given river station, but after downloading a number of CSVs it is apparent that there is only data for 1995 and 2010 with a long data gap. It is inefficient to download all the data files to find that data are missing. Try to plan metadata collection so that we can avoid such extra downloads where feasible.
[ ] Handle first the case where data are downloaded to a local file system. Code related to remote data access such as OpenDAP or remote CSV reading should probably go in a different repo, e.g. Xarray, dask, or other.
[ ] It is also useful to be able to serve data (serve a collection) - data that has been downloaded or acquired elsewhere
[ ] The downloaded collections should be essentially standalone: the metadata, data files, and their organization should allow usage without the user understanding all the details of how the collection was downloaded

PeterDSteinberg commented 7 years ago

[ ] This data catalogue concept needs to also include the standardization of Python data structures that are created from the downloaded data collections. We discussed that xarray.Dataset may be the best data structure for collections.
[ ] We need to consider filters that may be needed on the downloaded collections, e.g. spatial regridding or an analysis that requires a elevation model and a rainfall raster. This is one of the reasons that the xarray.Dataset is a good standardization. In the example of a filter function that needs elevation and rainfall data, the function could check the Dataset for expected keys (DataArrays) related to the required data and error out with a useful message if necessary data are missing.

dharhas commented 7 years ago

Some comments:

Features:

If you can find a better word, go for it, nobody is happy with feature. Typically a lot of services use: site, site code, monitoring station, location etc. This works reasonably well for data at a particular point but is somewhat misleading in other cases. The OGC SOS2 spec has a concept of 'Feature of Interest' which is where the terminology 'feature' came from (see http://www.ogcnetwork.net/sos_2_0/tutorial/om).

Provider/Service:

This can be a bit tricky. For example SRTM data can be downloaded from several locations but the original Provider is Nasa. In other efforts we have used Provider to mean a combination of the source organization and a particular method of access (i.e. NOAA Coastwatch Tabledap service). Then the 'Services' are distinct dataset services within that main service. This isn't the only way to do it but it seems reasonable from an implementation standpoint.

Parameters: Some questions to be addressed:
- Are you going to attempt a common vocabulary (i.e normalize temp01 to temperature, 03445 to salinity). We loosely follow the cf conventions when possible i.e. sea_water_salinity, air_temperature etc. Unfortunately the CF conventions don't cover all the variables needed. A good place to look for work done on ontologies is https://marinemetadata.org/conventions/vocabularies
- If you plan to have a common vocabulary, mapping needs to be done between the source parameter codes and the common vocabulary. You will not be able to map all available parameters from all data sources, how will you handle non mapped parameters
- Are you going to use a flat or nested structure for parameters. i.e. the USGS uses the concept of a parameter name and a statistic, i.e. name=Temperature, statistic=Daily Mean. WaterML2 uses concepts of name, time support, statistic and medium. i.e. Mean Temperature of Water over a 15 min period. My opinion is using a name, statistic approach is the best compromise since it allows you to resample datasets to different time periods.
- units .... aaaarggghhhhh. This should be simple and obvious but it is not. The strategy we have taken is to use a library like pint (https://github.com/hgrecco/pint) to store the units in metadata and when conversions are needed we do them manually using pint to calculate to conversion factors and apply them to the data. The libraries that try to add the units to the datastructure and do implicit conversion were not very robust, had wierd oddities or abandoned last time I looked. I know yt-project and astropy have done some interesting stuff around units but I have not evaluated them
Metadata: One of the ongoing issues we have had is how to store and maintain metadata in Pandas. We have some workarounds where our open/save routines push/pull the metadata to an attributes in the h5 file and then we explicitly modify and move metadata to newly created datasets when we run transformation filters on datasets

jbednar commented 7 years ago

providers

Maybe distinguish "provider" (where we get the data) from "source" (or "creator", i.e., where the data originally came from)?

PeterDSteinberg commented 7 years ago

I'll add here an additional requirement of this data catalogue service and we may make separate issues for it over time: We need to avoid accidental massive downloads. For example, if I plan to download what I think is about 100 GB of data but I misconfigure the bounding box of the query in space / time and I attempt to download 10 TB. Consider config/CLI/env settings that control one or more of:

What is the largest download I expect to do.
How much data do I expect to download now when I run this script or yaml spec
Are there ways the data catalogue can gather metadata about data services that may help inform guesses on data download size

ContinuumIO / earthio

Earth science data catalogue #16