Closed fmichonneau closed 9 years ago
This captures the main idea well--I think it is probably (?) should be noted that tidy data in Hadley's sense is most applicable to delimited text data (tsv/csv), and that the structure of observations might be very different with hdf5 or netCDF, for example, although both of the latter could also be considered machine readable as long as their data structures are set up properly in some standard fashion. And I just thought of this, but do we talk anywhere about storing data in the short or long term in a relational database (and the pros/cons)?
Perhaps focus on 'normalized' rather than 'tidy' data. Is tidy a special case of normalized? Does the definition always hold that tidy = less repitition? Is tidy a recent @hadley term, or a general, robust concept like normalization?
I think @fmichonneau means it in the sense of Hadley's paper (which he cites)
I suppose in this case 'tidy' is a particular subset of 'normalized'.
This might be overly pedantic, but I'm not sure CSV qualifies as a format that has clear specifications. There is RFC 4180, but in practice there are many formats that get called CSV but aren't compatible. This isn't an argument to remove CSV/TSV from the list of suggested file formats, but should we call attention to the lack of a common standard and/or suggest that people create CSVs that adhere to RFC 4180?
Tidy data is (pretty much exactly) Codd's 3rd normal form, but framed in a way that most people can actually understand.
@karawoo This paper could be a good opportunity to promote RFC 4180 (one of my personal favourites). In Rule 4 probably.
Thanks all for your comments!
:+1:
Tidy data is (pretty much exactly) Codd's 3rd normal form, but framed in a way that most people can actually understand.
Thanks @hadley for the clarification. I think it will be useful to provide this context.
@fmichonneau Nice work on this, I think it's coming together well. Just a couple of thoughts on some of the other formats. NetCDF and HDF5 formats don't necessarily take on the form of "tidy" data you refer to. My reading is that tidy data is a way of having well arranged data for a single flat file. However NetCDF is conventionally a gridded format where each cell can be thought of as a z value in a 2-D x-y grid. HDF5 files can take this form too, however they also can be used to store files in a nested format analogous to a directory structure (but not necessarily). Here's an ecological example far clarification. Let's say I'm measuring DBH at different sites in different regions. I might store that data in HDF5 like this: Region1/ |- SiteA/ |- Tree1, 4.5 |- Tree2, 3.9 |- Tree3, 2.6
If I want to analyze this data in a nice Tidy flat file it would be something like Region1 SiteA Tree1 4.5 Region1 SiteA Tree2 3.9 Region1 SiteA Tree3 2.6
I think there's a couple of other things to consider in this section:
1). Are all open formats by definition machine readable? E.g. Are all ascii formats machine readable (because they are ascii), or only ones that adhere to a standard?
2). What about databases? Is MySQL an a machine readable format? Or would we only want to recommend databases that can be exported (I'm thinking about how CouchDB just stores data in JSON). Is HDFS a machine readable format? It's an open source standard.
3). A slightly clearer definition of what "machine readable" means exactly. Is it using an open format with a standard? Has embedded metadata? I wish I had more insight, I'm still trying to wrap my head around it.
Perhaps a key point is "for smaller datasets, use flat files because they are easy and accessible". I propose a definition and methods for handling bigger data issues in rule 9 (#39)
Are databases or NetCDF useful for small datasets? I am not sure, except where the small dataset is a subset of a larger one that is already stored in NetCDF or a database.
To clarify a few points made by @emhart, though:
In this example,
Region1/
|- SiteA/
|- Tree1, 4.5
|- Tree2, 3.9
|- Tree3, 2.6
This is the same representation (right?) as can be done in HDF5 or CSV as shown above or in a three-table relational database (tables regions, sites, trees
with each region having >= 0 sites, and each site having >= 0 trees).
An alternative representation might be spatially explicit: a recursive PostGIS table in which the geometry column can represent regions and sites as vector polygons and trees as points, supporting queries to locate all trees within some site. Not sure how much of this is useful.
Again, I suggest that for data that fits in memory, such models are less essential than having tidy, i.e. easily useable, data.
I know that "ASCII" is useful shorthand, but it is also jargon, and strictly it refers to a archaic 7-bit code table for unaccented Roman characters. Not all science is done in English. Better to recommend Unicode in general, and UTF-8 in particular. UTF-8 is of course backwardly compatible with ASCII, or rather, ASCII-encoded data is also UTF-8 data.
Here is an initial draft for Rule 10.
There might be material in it that belong to other rules. Let me know if you think it's the case.
@naupaka and @emhart let me know if I captured all the concepts you were hoping this rule would cover.
Note that I didn't commit the manuscript here as the compilation fails on my system (probably issue with version of pandoc) and I haven't had a chance to investigate.
Cheers!