DiSSCo / SDR

Specimen Data Refinery
Apache License 2.0
7 stars 0 forks source link

Document challenges found using incremental FDOs in workflows #59

Closed PaulBrack closed 1 year ago

PaulBrack commented 2 years ago

From #47 :

Another point for discussion: at the moment, every tool has openDS as input and code.py maps these to opends_properties. The drawback of this, is that every input variable needs to be munged into the openDS structure. Take for example, GEORG: this has locality text string input. For this to be run as a standalone tool on a spreadsheet of data, the spreadsheet would need to contain (or be converted to) openDS objects.

Would it be better for tools to accept the inputs they actually require? Instead of each tool converting the openDS into the inputs, we would have instead an openDS mapper tool, which would take the openDS input, define the expected outputs which would be plucked from the openDS JSON, and then feed these into the subsequent tool.

So rather than openDS => tool we would have openDS => openDS mapper => tool.

Advantages: more in line with the way Galaxy is designed to be used. Also, there is a lot of code redundancy - each tool needs code.py to extract the openDS properties. Instead, there would be one tool that did this.

PaulBrack commented 2 years ago

Decided to leave this for now and reconsider in next milestone

Cubey0 commented 2 years ago

OK - works for me. Thanks for your responce.

PaulBrack commented 2 years ago

Writing proposed solution today

stain commented 2 years ago

We agreed to delay this till after the May 2022 deliverable.

Perhaps have two inputs/outputs of openDS (JSON only) and data (dataset of files, possibly an RO-Crate) to make data flow within Galaxy explicit rather than implicit for files not yet published with DISSCO (currently stored in a temporary directory within Galaxy).

If we move the common openDS processing to a pip installable module, then this could be used by a Galaxy wrapper per tool, and the tool (where appropriate) can be less openDS-aware, so that there in effect cuold be two wrappers, one that is passing openDS+data, and another wrapper that is doing the files natively - which is better for testing and use in other workflows.

llivermore commented 1 year ago

Also described on this slide: FDO Challenges Also here: Livermore, Laurence; Brack, Paul; Scott, Ben; Soiland-Reyes, Stian; Woolland, Oliver (2022): The Specimen Data Refinery: Using a scientific workflow approach for information extraction. figshare. Presentation. https://doi.org/10.6084/m9.figshare.21312345.v1

Need to write up for final report.