emo-bon / MetaGOflow

MGnify oriented implementation for the Marine Genomic Observatories oriented pipeline, developed in the framework of an EOSC-Life funded project
https://metagoflow.readthedocs.io
Apache License 2.0
7 stars 7 forks source link

improve output files metadata in the ro-crate #39

Open hariszaf opened 1 year ago

hariszaf commented 1 year ago

Thanks to the edit-ro-crate.py script we now have descriptions for the RO-Crate that the wf returns; see #18.

However, the description can improve to a great extent and probably add more things there; that I guess is a bit of research issue as well.

@cymon @isanti @kmexter I open this issue mostly as a poke for future versions of the wf. :tada:

kmexter commented 1 year ago

OK, we can certainly have a look at that, in the context of FAIR EASE for sure.

hariszaf commented 1 year ago

:+1: it will be really easy to add whatever i believe.

evangelospafilis commented 11 months ago

Hi all! this is a nice thread! @hariszaf thank you for opening it. The following are examples of some simple additions in the output file metadata, file formats in particular. The main motivation of such additions is to increase machine readability further.

Via looking up ontologies like EDAM (https://edamontology.org/page) the "fileFormat" of some of the metaGOflow output files could be described e.g. # 1 "@id": "results/ERR4765907_2.fastq.trimmed.fasta", "@type": "File", "encodingFormat": "text/plain", "name": "Filtered .fastq file of the single-end reads (forward/reverse)." could also include: "fileFormat": {"@id": "http://edamontology.org/format_1929", "name":"FASTA"} where: http://edamontology.org/format_1929 corresponds to the FASTA sequence format

e.g. # 2 "@id": "results/taxonomy-summary/SSU/ERR4765907.merged_SSU.fasta.mseq_json.biom", "@type": "File", "encodingFormat": "application/json-ld", "name": "BIOM formatted taxon counts for SSU sequences" could also include: "fileFormat": {"@id": "http://edamontology.org/format_3746", "name":"BIOM format"} where: http://edamontology.org/format_3746 corresponds to the biom BIological Observation Matrix format

e.g. # 3 ditto for TSV files (http://edamontology.org/format_3475; Tabular data represented as tab-separated values in a text file) and so on!

This could be really powerful as subsequent analysis modules could not only cross-check if an input file is e.g. text, but it also conforms to the appropriate file format.

kmexter commented 11 months ago

Sounds good. I have not looked at what is currently there, but to give you an idea of a minumum of provenance information that we are adding to the ARMS co-create.json files, see my spreadsheet https://docs.google.com/spreadsheets/d/12Xc19hyD0NUoLezvUjMtZU6KBsyGvZ0dhPde59xEkmQ/edit?usp=sharing which I am using to tell Cedric what to add to the ro-crates that are automatically created for the 4 repos mentioned in the 4 tabs. You will see things like geo coverage time coverage contributors influences description (I believe you have this field already) keywords

to which perhaps you could add dependsOn dependOn