FAIRMetrics / Metrics

This repository contains the results of the FAIR Metrics Group
http://fairmetrics.org
MIT License
105 stars 24 forks source link

General: metadata and data should be separatly scored/made criteria #16

Closed JervenBolleman closed 4 years ago

JervenBolleman commented 6 years ago

I feel the current use of (meta) data is a bit confusing. Not sure how to resolve it but one option is to duplicate the criteria if applicable for both.

markwilkinson commented 6 years ago

I thought about this while I was writing the questionnaire, and I ended-up concluding the opposite. The expectations of FAIR are (with very few exceptions, like the metadata preservation issue) identical for both metadata and data. Those few exceptions are sufficiently few that I am tempted to deal with them as exceptions, rather than duplicate everything else.

krobasky commented 6 years ago

Consider this scenario:

2 pipelines use 2 different aligners to create 2 bam-files (b1, b2) from the same FASTQ file (F1), and that FASTQ file is linked to the biosamples and assay protocols through a LIMS (Lab Information Management System). All the files b1, b2 and F1 contain all the short reads, and thus F1 can be reconstituted from either b1 or b2. To save on storage, F1 is aged-out, but the provenance details for F1 is maintained in the LIMS and can thus be tracked back from the bams.

In this scenario, the FAIR Meta-Data (FMDI) that wraps the b1 bam file connects to the assay information via the FASTQ. The FMDI for the FASTQ (in this case, implemented by a LIMS), must be persistent and indexed for searching, but the FASTQ need not be.

This illustrates an example that's fairly common in sequencing where the expectations of FAIR for the metadata isn't identical to the data. I think this covers more than just preservation, but I'd like to hear what do others think?

markwilkinson commented 6 years ago

I'm not seeing the issue beyond preservation...? Please explain. thanks!

krobasky commented 6 years ago

It's possible that I'm defining the boundary between "data" and "metadata" differently... and I'm also reluctant to duplicate all the criteria, but metadata and the data it describes do seem to be different classes that deserve different FAIRness measures; perhaps rather than duplicating criteria, the solution is to only compute the FAIRness of the data in the context of how it is served (e.g., data + meta-data + data-server)

Maybe another example is helpful--

Consider a VCF file (that is, a file format for storing human genetic variants). If the VCF format is properly implemented, the file is Interoperable with tools. The format allows for information about provenance, which if implemented makes the file Reproducible. However the format says nothing about indexing, searching, how to serve the file, or even authorized access. The expectation would be on the metadata implemented by the portal or file server to meet those Findable and Accessible principles, not on the data file itself, right? In fact, the 'F' and 'A' principles don't make a lot of sense, I think, for many of the standalone files used by NIH researchers, including VCFs, gtfs, FASTQs, and bams.

Thoughts?

markwilkinson commented 6 years ago

Consider a piece of pottery from an archaeological dig..........

On 07/04/2018 04:40 PM, Kimberly Robasky wrote:

It's possible that I'm defining the boundary between "data" and "metadata" differently... and I'm also reluctant to duplicate all the criteria, but metadata and the data it describes do seem to be different classes that deserve different FAIRness measures; perhaps rather than duplicating criteria, the solution is to only compute the FAIRness of the data in the context of how it is served (e.g., data + meta-data + data-server)

Maybe another example is helpful--

Consider a VCF file (that is, a file format for storing human genetic variants). If the VCF format is properly implemented, the file is Interoperable with tools. The format allows for information about provenance, which if implemented makes the file Reproducible. However the format says nothing about indexing, searching, how to serve the file, or even authorized access. The expectation would be on the metadata implemented by the portal or file server to meet those Findable and Accessible principles, not on the data file itself, right? In fact, the 'F' and 'A' principles don't make a lot of sense, I think, for many of the standalone files used by NIH researchers, including VCFs, gtfs, FASTQs, and bams.

Thoughts?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/FAIRMetrics/Metrics/issues/16#issuecomment-402498154, or mute the thread https://github.com/notifications/unsubscribe-auth/AA8NLqJubkY9g-0dYAhwdgHx8wvoKYCvks5uDNPFgaJpZM4RYTni.

--

Mark Wilkinson Madrid, Spain

markwilkinson commented 4 years ago

closing a gen1 metric issue (I think it is dealt-with by gen2 anyway)