Open cmungall opened 7 years ago
it shouldn't be that hard to record metadata, just look for where the MD5 hash is updated and stick another hook in there.
James Taylor suggests using PROV: https://twitter.com/jxtx/status/916406694674132992
I am thinking of making a start on this, very soon using PROV-O as the vocabulary. This is also used by projects like wf4ever
the basic model has 3 classes, entity, agent and activity
I think the primary agent would be biomake itself, with an acted-on-behalf-of edge to the person executing the workflow. The entity would be the file, and the activity would be the makefile recipe/rule.
The primary output would be rdf/turtle, but we could also have json too (as well as a native prolog representation). Having some kind of dot/grpahviz export should also be simple.
@cmungall I like this, especially how clean the mapping to PROV-O is: I think most/all of those things in the diagram are already being calculated at some point in biomake.
Reproducibility and provenance are increasingly important.
Makefiles and Makefile-like solutions such as biomake help with reproducibility; if the recipe and input files are provided in a github repo then in theory it is easy to re-executed and hopefully get the same answer.
However, if the final output files are submitted to a data repository, the provenance may not be immediately obvious. Initiatives such as BD2K are emphasizing the importance of metadata on all digital objects, which includes analysis results. Of course it is possible to manually annotate these artefacts, but why do that when this can be automated.
It should be possible for any file derived from biomake to immediately see a graph of objects used to derive it, together with complete metadata on each; this includes standard filesystem metadata e.g. timestamp but additional metadata too. See also https://github.com/W3C-HCLSIG/HCLSDatasetDescriptions
This may be a heavyweight feature so may be best implemented as some kind of plugin.