Metadata, provenance, and characterizing uncertainty

mbjones commented 10 years ago

Organizational Page: MetaProv&Uncrn Category: Data science Title: Metadata, provenance, and characterizing uncertainty Proposed by: Ben Best Participants: Summary: Tabular Data: easily create metadata, trace provenance, and characterize uncertainty. Shared some links with Matt Jones and Peter Slaughter on this, but essentially interested in tracking data throughout its derived processes and quantifying uncertainty introduced along the way, particularly with Ocean Health Index applications in mind.

bbest commented 10 years ago

Interested in lightweight tools and protocols for cascading information along CSV tabular files. Here are a few resources for exploration:

DAT: synchronize and transform large datasets in a JSON, Node.js and LevelDB framework. Attention towards open civic and scientific data. Currently in pre-alpha.
Tabular Data Package: simple JSON metadata documentation standard that allows for field typing, descriptions, relational schema, and tools to work with other software including:
- RODProt: R Client for Interacting with Data Encoded in one of the Open Data Protocols Standards
- DataPackage Viewer: renders tables, metadata and plotting interface for a given data package URL (eg in github.com/datasets). Works with GeoJSON
daff: produce differencing in CSV files at row and column level
OpenCPU: software system for embedded statistical computation and reproducible research. The server exposes a web API interfacing R, Latex and Pandoc. (See papers and useR 2014 talk.)

For the Ocean Health Index (OHI-science.org), I developed a couple quick R functions to mine a git repository (with ropensci/git2r) for reading CSVs at every commit and trace specific values (given by a filter expression) over time. This helps to identify when changes occurred and hint at other files in prior commits responsible.

Would like to expand on this with smarter, more certain tools. In particular, should consider sequence and dependency tools for reproducible science like make, py doit and linear-flow developed by OHI's Darren Hardy. Other resources: Packrat, a dependency management system for R; CRAN Task View: Reproducible Research.

Here are more OHI project notes on capturing uncertainty: ohicore#46, ohiprep:wiki/Whence.

Quite a smorgasborg of links and ideas so far. Need to hone goals, categorize tools and elaborate on a few use cases to motivate targeted development for CodeFest.

Curious to hear about best practices, other tools and suggestions for strategic improvement from others?

cboettig commented 10 years ago

@bbest This sounds great to me, particularly anchoring the objective in terms of some of tracking uncertainty, provenance and metadata through a specific data set such as you have with OHI data. In terms of narrowing this down a bit, I would note that several of the ideas here are synergistic with other issues proposed so far.

For instance, given an R-based pipeline of the data processing necessary to generate a OHI country index, Issue #14 would then provide a promising way to describe the provenance, while issue #1 might address the challenge of adding the metadata to CSVs (perhaps the nascent EML package from ropensci can help there as well). Meanwhile, having an actual pipeline that goes from raw data to some summary final value like the OHI country index would be really cool and I think a good use case to try and tackle the thorny issue of uncertainty quantification.

I could see two different routes you might go with the uncertainty quantification --

1) Annotate uncertainty working with just raw messy csv files and the kind of diff tools you describe in order to generate the provenance / metadata describing the uncertainty (as best as it can be estimated), or

2) Synthesize uncertainty Start with a mock-up example that has existing metadata and provenance characterizing the uncertainty and develop an 'uncertainty aware' tool that could combine data annotated in this way, while propagating the uncertainty appropriately.

Clearly these two strategies could work together in sequence, but it might help to focus on either the "Annotate" or the "Synthesize" steps one at a time. Anyway, just brainstorming here, don't worry if that's not the right division.

aashish24 commented 10 years ago

I will be interested in this session.

Aashish

NCEAS / open-science-codefest

Metadata, provenance, and characterizing uncertainty #11