emhart / 10-simple-rules-data-storage

A repository for the 10 simple rules data sharing paper to be submitted to PLoS Comp Biology
Creative Commons Zero v1.0 Universal
25 stars 13 forks source link

Keep the raw data raw #20

Closed naupaka closed 9 years ago

naupaka commented 9 years ago

The raw data should be kept raw and archived immediately. This is the data which shall not be overwritten or modified. This is the data others will want, and against which your choice of subsequent analyses will be judged.

PBarmby commented 9 years ago

I wouldn't argue with this as a general principle; there are however cases (eg radio astronomy) where the data rate is so large that keeping the raw data is not considered feasible.

dlebauer commented 9 years ago

(eg radio astronomy) where the data rate is so large that keeping the raw data is not considered feasible.

more generally, the definition of 'raw' is not always clear - rarely is the 'raw' voltage differential (on which many sensors are based) recorded.

As a more concrete example of defining raw: an infrared gas analyzer will measure absorption at a particular wavelength, but has been calibrated to report (e.g.) atmospheric concentration of CO2. Is the %absorption 'raw' or the %CO2. In the case of a robust technology with good calibration and accuracy, it may seem obvious that the %CO2 is sufficiently 'raw', but this may be less clear for less common methods that require operators to constantly run calibration curves.

Not sure how to simplify this idea, but I believe it is an important concept, probably dealt with in the NEON documentation (@emhart?)

emhart commented 9 years ago

NEON handles this with a schema for various "levels" of data products that pertain to the amount of processing that happens, here's a brief overview: http://www.neoninc.org/science-design/data-processing

We defined raw data as things like voltage, or unprocessed lidar returns. The issue of course is that this is a tremendous amount of data that NEON still hasn't quite figured out how to share and document (e.g. if you ask for L0 they'll ship you an HDD). The way around it is to write detailed "Algorithm Theoretical Basis Documents" (ATBD's) detailing the different processing "levels". This is mostly borrowed from the NASA EOSDIS program.

I really like this idea, but I think we need to have some technical caveats, like as close to raw as possible given current technical limitations. What we really want is to get people to think about the spirit of what @naupaka is saying.