Closed khinsen closed 8 years ago
Very good points. One of the early incarnations of this rule (#12) "store data in common open format", but "easy to process by software / widely supported" are also important features
Common and widely supported is not the same as easy to process. An example is the PDB format for macromolecular structures, which is common and widely supported, but such a mess that processing it correctly is a huge effort. In fact, almost no software processes PDB files correctly according to the specification.
In such situations, my personal advice is to go with a simple and clear format, even if it is not the most popular one. Otherwise the community remains stuck with a bad format forever. I know many people do not agree with this point of view.
I see the point that @khinsen is making here, but the term machine-readable is widely used and is not equivalent to any file stored on a computer. In my opinion, the term should be retained in the manuscript but we should improve on the definition/meaning.
I think @PBarmby has sufficiently addressed this issue, closing.
The term "machine-readable" in rule 5 isn't very clear in my opinion. Any computer file is machine readable by definition. The opposite of machine-readable would thus be data stored on printed paper, but that's not the message of rule 5.
In technical terms, the topic of rule 5 is how easy a data format is to parse. The extreme end of the spectrum is data that is impossible to parse because there is no formal data format at all.
Given the level of this paper, I understand that the term "parsing" should better be avoided, but that makes it difficult to be precise. One possibility is being vague in the title ("Data should be easy to process by software") and give both "good" and "bad" examples in the text. The archetype of the bad example is data embedded in prose stored in a Word or PDF file (yes, I have seen that).