Open mr-c opened 6 years ago
@psafont Can you update the 1st comment above with your status and any additional work you see that is needed?
There's quite a bit of friction in order to do the changes because CWLProv is part of the cwltool package. I don't know up to what point can it be beneficial to separate it into a different module.
There is not much separation of concerns in some functions: they use provenance.py
's functions directly. I think this is linked with some of the tight coupling we've already solved. The question is how far do we want to go. (I've only spent about an hour going into @inutano's provenance work)
[X] Refactor
CWLJob.run()
to return(outputs, metadata)
instead of justoutputs
.metadata
is a dictionary that will contain the information we need for generating CWLProv.[x] Propagate the metadata through the
.run()
calls to the root of the computation[ ] Try to reuse Toil's Jobstore ID's (See https://github.com/DataBiosphere/toil/issues/2449) for each
CWLJob
record this ID and the parent ID.[ ] Fill metadata with a data structure containing runtime information about the tasks (tree or dict, with the keys being the jobstore IDs)
[ ] Generate a
ProvenanceProfile
per task and aResearchObject
when all the metadata has been gathered.[X] Refactor
cwltool/provenance.py
so that recorded time and time of recording are decoupled.[ ] Refactor
ProvenanceProfile:prospective_prov
out of the class to be the function that creates all theProvenanceProfile
s and relates them in a tree-like structure.[ ] Refactor
cwltool/provenance.py
so that we can defer file movements until the end of the run[x] Update Toil to use cwltool with the fixes (https://github.com/DataBiosphere/toil/pull/2469)
Most of the progress is found on https://github.com/DataBiosphere/toil/tree/wip-prov
┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-280