evoldoers / biomake

GNU-Make-like utility for managing builds and complex workflows
BSD 3-Clause "New" or "Revised" License
102 stars 9 forks source link

plugin idea: automatic metadata annotation #15

Open cmungall opened 7 years ago

cmungall commented 7 years ago

Reproducibility and provenance are increasingly important.

Makefiles and Makefile-like solutions such as biomake help with reproducibility; if the recipe and input files are provided in a github repo then in theory it is easy to re-executed and hopefully get the same answer.

However, if the final output files are submitted to a data repository, the provenance may not be immediately obvious. Initiatives such as BD2K are emphasizing the importance of metadata on all digital objects, which includes analysis results. Of course it is possible to manually annotate these artefacts, but why do that when this can be automated.

It should be possible for any file derived from biomake to immediately see a graph of objects used to derive it, together with complete metadata on each; this includes standard filesystem metadata e.g. timestamp but additional metadata too. See also https://github.com/W3C-HCLSIG/HCLSDatasetDescriptions

This may be a heavyweight feature so may be best implemented as some kind of plugin.

ihh commented 7 years ago

it shouldn't be that hard to record metadata, just look for where the MD5 hash is updated and stick another hook in there.

cmungall commented 6 years ago

James Taylor suggests using PROV: https://twitter.com/jxtx/status/916406694674132992

cmungall commented 6 years ago

I am thinking of making a start on this, very soon using PROV-O as the vocabulary. This is also used by projects like wf4ever

the basic model has 3 classes, entity, agent and activity img

I think the primary agent would be biomake itself, with an acted-on-behalf-of edge to the person executing the workflow. The entity would be the file, and the activity would be the makefile recipe/rule.

The primary output would be rdf/turtle, but we could also have json too (as well as a native prolog representation). Having some kind of dot/grpahviz export should also be simple.

cmungall commented 6 years ago

Another possibility here is allowing the user to easily generate a bagit or bagit-ro for their folder once the workflow is executed.

ihh commented 6 years ago

@cmungall I like this, especially how clean the mapping to PROV-O is: I think most/all of those things in the diagram are already being calculated at some point in biomake.