USGS Time Series and data pipeline integrity with sensorQC

USGS has time series data on water quality. Data comes from satellites, sensors all over the United States, and other kinds of sources at various time intervals and in different formats. Data is growing fast as technology is deployed. For example, with data on dissolved oxygen in water sources, inn 2014: 560 sites and 2008: 168 sites.

'Open data'

They have a couple of web applications that serve to 'open' the data through REST apis and visual filtering in text forms. They are http://waterqualitydata.us and http://nwis.waterdata.usgs.gov/nwis/qwdata. Here is an example query to get some time series data on California groundwater.

They understand that this isn't necessarily the best way to open data, particularly for developers and data scientists. To make it easier, they have a variety of R wrappers for these data retrieval web services.

Data pipeline

Starts as preliminary (level 0, primary source data). Once it gets cleaned up its no longer preliminary. Gets 'blessed' by a water science technician. Often it is a manual, visual QAQC procedure.

When changes are made (going from level 0 to level 1), they are not be publicly exposed. It can be difficult to track what changes are made.

Data collaboration

Use git to collaborate w/ people who aren't technically there but they can do it enough. They use pull requests and right now, use travis CI to verify data integrity.

Data integrity

sensorQC https://github.com/jread-usgs/sensorQC Work in concept for testing data integrity as data passes through data pipelines. A user can flexibly define what statistical tests are to be performed to ensure the data is still valid after a change, described in a yaml file. A test is performed on the data according to this yaml file, and a column is added that is empty if the data passes the test, but has an expression if the row fails that qualifies all of the ways that the row failed the given tests.

Example sensorQC yaml file: https://github.com/jread-usgs/sensorQC/blob/master/inst/extdata/pellerin.yml

Example data input: Takes 30 readings every 3 hours. https://github.com/jread-usgs/sensorQC/blob/master/inst/extdata/test_data.txt

Example data output (with added column of failures): https://github.com/jread-usgs/sensorQC/blob/master/inst/extdata/pellerin_sqc_out.tsv

Simple plotting diffs can also be useful. Do a statistical test, and show the flags at points in the time series that don't pass the test. Most people who do data cleaning are very visual about it, and don't want to lose that.

Related to dat pull requests:

When collaborating, it'd be nice to have something like travisCI that bakes in something like SensorQC to verify that pull requests are adhering to the data integrity of the existing columns.

Do you have anything else to add, @jread-usgs?

dat-ecosystem / dat