FAIR data decisions: Lossy or lossless

hrzepa commented 7 years ago

[//]: # "==Do not write above this line== Instructions for posting issues: (1) Review what is already there. Perhaps a comment to an existing issue would be more appropriate than opening a new one? (2) Write your post below using Markdown (as per https://guides.github.com/features/mastering-markdown/ ) or just plain text. (3) Don't worry about these introductory lines - you can leave or delete them, as they won't display anyway (you can check this via Preview). (4) Hit the 'Submit new issue' button. ==Write below this line==" One of the issues often confronted by depositors of aspiring FAIR data is how much data loss to tolerate. I give just one example, crystallographic data in chemistry (often described as the Gold Standard in chemical Data). There are the following hierarchies, with increasing data loss:

The raw instrument data
The processed instrument data, including "hkl" information
The processed instrument data, including rich structure information but excluding "hkl" data
The processed minimum dataset, which suffices for perhaps 90% of most user's needs
A graphical representation of the minimum dataset, as a JPEG or PDF...
which itself can be lossy.

So most consumers of say category 4 would find it adequately FAIR for their needs, but some specialist users would find it too lossy, and might need to go as high as category 1. The trouble is that this type of data might be as much as 10,000 times larger than the minimal set.

Unfortunately there is no easy way of specifying the degree of data loss in any aspiring FAIR dataset as metadata information. This remember is considered the "gold" standard. One finds similar situations in other types of chemical data.

evomellor commented 7 years ago

"Unfortunately there is no easy way of specifying the degree of data loss in any aspiring FAIR dataset as metadata information." Do you mean that once only the cleaned data are presented (e.g. category 4) it is impossible for another person to quantify the loss from category 1?

Though this would not preserve the lost information, metadata for a shared, cleaned dataset should ideally contain information about the cleaning process, up to and including any scripts that were used to do the cleaning. Besides scripts, a narrative description of the cleaning process and any reasonable explanation of what information has been lost is good practice.

I'm a strong proponent of trying not to let the perfect get in the way of the practical (or any improvement upon the status quo). For situations where sharing and preserving large data sets are impractical, sharing category 4 is a vast improvement.

Do you recommend revising categories or the standards define for each?

band commented 7 years ago

NASA EOSDIS data products use a defined classification of Data Processing Levels. If such a classification is available for other data products then maybe it is enough to include that level specification in the metadata.

CaroleGoble commented 7 years ago

I would say the whole point is that there is no one FAIR. FAIR is a landscape of degrees - or levels. "50 shades of FAIR" and this is highly related to the metrics. The worst thing we can do is declare a single perspective.
FAIR means different things to different stakeholders for different purposes and that is to be celebrated and respected, not suppressed. What is “Rich metadata” varies per domain, and varies • Within and across disciplines • Across Layers of the infrastructure stack: EOSC e-Infrastructures vs Research Infrastructures • At the institutional level vs public archives level • Depending on the purpose:

"F" may be feasible, "I" may not be, (and by the way Reproducibility might be harder than Reuse
FAIR across research boundaries means that for the native discipline the metadata may not be enough for reuse but for the non-native it is. I've seen this is Sys Bio models. The modellers won't reuse but the experimentalists will.

CaroleGoble commented 7 years ago

The challenge will be distilling the “in common” without enforcing one view or need

FAIR-Data-EG / consultation

FAIR data decisions: Lossy or lossless #27