DiSSCo / SDR

Specimen Data Refinery
Apache License 2.0
6 stars 0 forks source link

Determine provenance approach #11

Closed benscott closed 11 months ago

benscott commented 3 years ago

Each component will update specimen object provence data with actions taken on specimen object, and task metatdata including:

This should be abstracted, so we have a common and reusable function.

As with #10 can this be a wrapper/Galaxy preprocessing step around the pipeline component?

llivermore commented 3 years ago

From discussions with @PaulBrack

How do we ensure a workflow is reproducible?

Would need:

Dockerised tools will make it easier. Can track identifiers of the docker containers. These can be stored in some kind of registry.

In terms of workflows and their arguments. We could provide the workflow run itself OR the design the SDR so it keeps a permanent record of all the workflows it has run. This would require a long-term silo...

benscott commented 3 years ago

Provenance approaches:

https://cloudevents.io/ http://demo.nsidr.org/specimens/f22fec45_7b47_11e4_8ef3_782bcb9cd5b5

llivermore commented 3 years ago

We should include workflow/component attribution in the MVP and we need a separate provenance object.

This should done in the manner planned for open Digital Specimens using the prov agent entity activity model (triples) embedded in cloud events. See: https://github.com/DiSSCo/openDS/blob/master/data-model/data-model-intro.md

Can this be extracted in the workflow history in Galaxy? Can you do it through the UI and/or the API? In the API you can request the entire history of the workflow upon completion.

benscott commented 3 years ago

Need to track changes made to the openDS object by each tool.

Need to confirm: Is the history object sufficient? If so, at the end of the workflow pull in and convert to prov - https://github.com/albangaignard/galaxy-PROV

If not, for each tool add in a diff in the post completion hook, using this as a data model: https://github.com/DiSSCo/openDS/blob/master/data-model/data-model-intro.md

PaulBrack commented 2 years ago

Was hoping this project could be resurrected but havne't heard anything from Ignacio Eguinoa. Have asked Stian if he knows anything about it. https://github.com/ieguinoa/galaxy-provenance-capture

Currently looking through Galaxy documentation to see how this is done.

PaulBrack commented 2 years ago

I have engagement with the Galaxy team on this - need to arrange a meeting with @benscott, @llivermore, Ignacio and Frederik to determine timelines on this

PaulBrack commented 2 years ago

Removed POC milestone as this will require further work past the milestone

PaulBrack commented 2 years ago

Moved onto current milestone as am meeting to discuss tomorrow

PaulBrack commented 2 years ago

Meeting today with Ignacio Eguinoa and Paul @ VIB-UGent

Need to schedule meeting with:

Paul Brack Frederik Coppens (if possible) Bjorn Gruning Ignacio Eguinoa David Lopez Laurence Livermore Stian Soiland-Reyes Paul @ Ghent padge@psb.ugent.be

Need to document test cases before the meeting - perhaps write a short presentation

Ignacio thinks there is enough bandwidth to get this working within a few months

PaulBrack commented 2 years ago

Ignacio has suggested this as a biohackathon topic

llivermore commented 2 years ago

@stain can you get an update on where this is at from the core Galaxy team and whether there is additional research/input required from me or others?

stain commented 1 year ago

Galaxy folks says this is still planned for the 2022.09 release. A nice unified UI is also coming along for both BioCompute object and RO-Crate export. See

https://github.com/galaxyproject/galaxy/pull/14606 https://github.com/galaxyproject/galaxy/pull/14639

llivermore commented 1 year ago

Moved to backlog and awaiting release from core Galaxy team - 2022.09 not released yet.

stain commented 1 year ago

https://github.com/galaxyproject/galaxy/pull/14606 has been merged and will add to Galaxy having RO-Crate export of Workflow RO-Crate, but without provenance. Next Galaxy release will be early 2023.

Provenance support is in https://github.com/ResearchObject/workflow-run-crate/pull/30 which we will work on in ELIXIR Biohackathon next week to move it to the right repo.

stain commented 11 months ago

Released in Feb 2023 https://galaxyproject.org/news/2023-02-23-structured-data-exports-ro-bco/