AU-BURGr / UnConf2017

Repository for Unconf Topics 2017
7 stars 2 forks source link

Tools for data/results provenance #5

Open MilesMcBain opened 7 years ago

MilesMcBain commented 7 years ago

I recently had the pleasure of using R as part of a team in a data science project. Despite the best reproducibility intentions we ended up getting ourselves in a mighty tangle with dataset versions, modelling results versions and the matching up of the two.

It got me thinking about the issue of provenance and the tooling in R. I'd be keen to work on any of the following:

A much more long winded proposal that motivates all of these is available here: https://github.com/MilesMcBain/journalr/blob/master/Journalling_tool_proposal.Rmd

jonocarroll commented 7 years ago

A possible component could be last year's suggested project of an 'R package to store/access metadata associated with data/functions': https://github.com/ropensci/auunconf/issues/18

MilesMcBain commented 7 years ago

Wow Jono you are right this is a very similar idea to that one!

I have in mind (and remember, this is all purely brainstorming at this point) the case where you load some data from a trusted source, validate that it is indeed unchanged (validate_checksum(data)), print out the context (context(data)$owner; context(data)$last_modified), etc... ditto for functions that do what one thinks they do (context(my_function)$assumptions). The context travels with the data/function and can be tested against, e.g.

Yes. I would be happy to try to hack up this.