Closed benscott closed 11 months ago
From discussions with @PaulBrack
How do we ensure a workflow is reproducible?
Would need:
Dockerised tools will make it easier. Can track identifiers of the docker containers. These can be stored in some kind of registry.
In terms of workflows and their arguments. We could provide the workflow run itself OR the design the SDR so it keeps a permanent record of all the workflows it has run. This would require a long-term silo...
Provenance approaches:
https://cloudevents.io/ http://demo.nsidr.org/specimens/f22fec45_7b47_11e4_8ef3_782bcb9cd5b5
We should include workflow/component attribution in the MVP and we need a separate provenance object.
This should done in the manner planned for open Digital Specimens using the prov agent entity activity model (triples) embedded in cloud events. See: https://github.com/DiSSCo/openDS/blob/master/data-model/data-model-intro.md
Can this be extracted in the workflow history in Galaxy? Can you do it through the UI and/or the API? In the API you can request the entire history of the workflow upon completion.
Need to track changes made to the openDS object by each tool.
Need to confirm: Is the history object sufficient? If so, at the end of the workflow pull in and convert to prov - https://github.com/albangaignard/galaxy-PROV
If not, for each tool add in a diff in the post completion hook, using this as a data model: https://github.com/DiSSCo/openDS/blob/master/data-model/data-model-intro.md
Was hoping this project could be resurrected but havne't heard anything from Ignacio Eguinoa. Have asked Stian if he knows anything about it. https://github.com/ieguinoa/galaxy-provenance-capture
Currently looking through Galaxy documentation to see how this is done.
I have engagement with the Galaxy team on this - need to arrange a meeting with @benscott, @llivermore, Ignacio and Frederik to determine timelines on this
Removed POC milestone as this will require further work past the milestone
Moved onto current milestone as am meeting to discuss tomorrow
Meeting today with Ignacio Eguinoa and Paul @ VIB-UGent
Need to schedule meeting with:
Paul Brack Frederik Coppens (if possible) Bjorn Gruning Ignacio Eguinoa David Lopez Laurence Livermore Stian Soiland-Reyes Paul @ Ghent padge@psb.ugent.be
Need to document test cases before the meeting - perhaps write a short presentation
Ignacio thinks there is enough bandwidth to get this working within a few months
Ignacio has suggested this as a biohackathon topic
@stain can you get an update on where this is at from the core Galaxy team and whether there is additional research/input required from me or others?
Galaxy folks says this is still planned for the 2022.09 release. A nice unified UI is also coming along for both BioCompute object and RO-Crate export. See
https://github.com/galaxyproject/galaxy/pull/14606 https://github.com/galaxyproject/galaxy/pull/14639
Moved to backlog and awaiting release from core Galaxy team - 2022.09 not released yet.
https://github.com/galaxyproject/galaxy/pull/14606 has been merged and will add to Galaxy having RO-Crate export of Workflow RO-Crate, but without provenance. Next Galaxy release will be early 2023.
Provenance support is in https://github.com/ResearchObject/workflow-run-crate/pull/30 which we will work on in ELIXIR Biohackathon next week to move it to the right repo.
Released in Feb 2023 https://galaxyproject.org/news/2023-02-23-structured-data-exports-ro-bco/
Each component will update specimen object provence data with actions taken on specimen object, and task metatdata including:
This should be abstracted, so we have a common and reusable function.
As with #10 can this be a wrapper/Galaxy preprocessing step around the pipeline component?