Open PeterDSteinberg opened 7 years ago
collections
. We discussed that xarray.Dataset
may be the best data structure for collections
. collections
, e.g. spatial regridding or an analysis that requires a elevation model and a rainfall raster. This is one of the reasons that the xarray.Dataset
is a good standardization. In the example of a filter function that needs elevation and rainfall data, the function could check the Dataset
for expected keys (DataArray
s) related to the required data and error out with a useful message if necessary data are missing.Some comments:
If you can find a better word, go for it, nobody is happy with feature. Typically a lot of services use: site, site code, monitoring station, location etc. This works reasonably well for data at a particular point but is somewhat misleading in other cases. The OGC SOS2 spec has a concept of 'Feature of Interest' which is where the terminology 'feature' came from (see http://www.ogcnetwork.net/sos_2_0/tutorial/om).
This can be a bit tricky. For example SRTM data can be downloaded from several locations but the original Provider is Nasa. In other efforts we have used Provider to mean a combination of the source organization and a particular method of access (i.e. NOAA Coastwatch Tabledap service). Then the 'Services' are distinct dataset services within that main service. This isn't the only way to do it but it seems reasonable from an implementation standpoint.
Parameters: Some questions to be addressed:
Metadata: One of the ongoing issues we have had is how to store and maintain metadata in Pandas. We have some workarounds where our open/save routines push/pull the metadata to an attributes in the h5 file and then we explicitly modify and move metadata to newly created datasets when we run transformation filters on datasets
providers
Maybe distinguish "provider" (where we get the data) from "source" (or "creator", i.e., where the data originally came from)?
I'll add here an additional requirement of this data catalogue service and we may make separate issues for it over time: We need to avoid accidental massive downloads. For example, if I plan to download what I think is about 100 GB of data but I misconfigure the bounding box of the query in space / time and I attempt to download 10 TB. Consider config/CLI/env settings that control one or more of:
earthio
should be a data catalogue (see also the wiki notes from meeting yesterday):elm-main
yaml spec approach for ML (elm-main
is temporarily deprecated during refactoring)features
: Generally these are spatial entities where data are reported, such as a lake's name; a state name; or other polygon, line or point. This could also include conceptual geography terms like "hot desert" for data that has been generalized to most hot deserts. I think we may want to consider changing the namefeatures
for this concept asfeatures
has several meanings, e.g. feature engineering in ML, features of a software, or geomorphic feature identification. Suggestions on an alternate name for the idea or should we leave it how it is? ( @dharhas @jbednar @philippjfr @gbrener )providers
: A provider may be an organization, such as NASA or NOAAservices
:services
are the distinct data sources of aprovider
, such as specific NASA dataset with related URLsparameters
: These are the names and related metadata for measurements, e.g. daily average temperature or hourly rainfall sums. A few considerations:parameter
, such ascollections
: To do an analysis, one might have one or morecollections
where acollection
is a group ofparameter
/feature
combinations and the related metadata,service
info, and options on how to save the data locally.Handle the following data downloading concerns:
service
/feature
combination actually has the data expected. For example, the meta level information may indicate USGS has water flow information for 1995 to 2010 for a given river station, but after downloading a number of CSVs it is apparent that there is only data for 1995 and 2010 with a long data gap. It is inefficient to download all the data files to find that data are missing. Try to plan metadata collection so that we can avoid such extra downloads where feasible.collection
) - data that has been downloaded or acquired elsewherecollections
should be essentially standalone: the metadata, data files, and their organization should allow usage without the user understanding all the details of how thecollection
was downloaded