Draft for Rule 10 - Githubissues

fmichonneau commented 9 years ago

Here is an initial draft for Rule 10.

There might be material in it that belong to other rules. Let me know if you think it's the case.

@naupaka and @emhart let me know if I captured all the concepts you were hoping this rule would cover.

Note that I didn't commit the manuscript here as the compilation fails on my system (probably issue with version of pandoc) and I haven't had a chance to investigate.

Cheers!

naupaka commented 9 years ago

This captures the main idea well--I think it is probably (?) should be noted that tidy data in Hadley's sense is most applicable to delimited text data (tsv/csv), and that the structure of observations might be very different with hdf5 or netCDF, for example, although both of the latter could also be considered machine readable as long as their data structures are set up properly in some standard fashion. And I just thought of this, but do we talk anywhere about storing data in the short or long term in a relational database (and the pros/cons)?

dlebauer commented 9 years ago

Perhaps focus on 'normalized' rather than 'tidy' data. Is tidy a special case of normalized? Does the definition always hold that tidy = less repitition? Is tidy a recent @hadley term, or a general, robust concept like normalization?

naupaka commented 9 years ago

I think @fmichonneau means it in the sense of Hadley's paper (which he cites)

I suppose in this case 'tidy' is a particular subset of 'normalized'.

karawoo commented 9 years ago

This might be overly pedantic, but I'm not sure CSV qualifies as a format that has clear specifications. There is RFC 4180, but in practice there are many formats that get called CSV but aren't compatible. This isn't an argument to remove CSV/TSV from the list of suggested file formats, but should we call attention to the lack of a common standard and/or suggest that people create CSVs that adhere to RFC 4180?

hadley commented 9 years ago

Tidy data is (pretty much exactly) Codd's 3rd normal form, but framed in a way that most people can actually understand.

drj11 commented 9 years ago

@karawoo This paper could be a good opportunity to promote RFC 4180 (one of my personal favourites). In Rule 4 probably.

fmichonneau commented 9 years ago

Thanks all for your comments!

I am not familiar enough with hdf5 and netCDF to know if it makes sense to use the tidy concept for data stored in this kind of format.
I don't think we talk about the pros/cons of data formats for the different stage of the analytical process. Maybe it could be covered by Rule 2 (#32)? I am going to create a new issue to make sure it doesn't get forgotten
I'll add a mention about normalized data in the next revision of the text.
@karawoo it might be pedantic, but I agree. CSV files can take many shapes and forms. Let's talk about RFC4180 in the Rule 4 (#34). My point here was to use data formats that can be imported in a variety of computing environment without having to write a parser for it.

karawoo commented 9 years ago

:+1:

dlebauer commented 9 years ago

Tidy data is (pretty much exactly) Codd's 3rd normal form, but framed in a way that most people can actually understand.

Thanks @hadley for the clarification. I think it will be useful to provide this context.

emhart commented 9 years ago

@fmichonneau Nice work on this, I think it's coming together well. Just a couple of thoughts on some of the other formats. NetCDF and HDF5 formats don't necessarily take on the form of "tidy" data you refer to. My reading is that tidy data is a way of having well arranged data for a single flat file. However NetCDF is conventionally a gridded format where each cell can be thought of as a z value in a 2-D x-y grid. HDF5 files can take this form too, however they also can be used to store files in a nested format analogous to a directory structure (but not necessarily). Here's an ecological example far clarification. Let's say I'm measuring DBH at different sites in different regions. I might store that data in HDF5 like this: Region1/ |- SiteA/ |- Tree1, 4.5 |- Tree2, 3.9 |- Tree3, 2.6

If I want to analyze this data in a nice Tidy flat file it would be something like Region1 SiteA Tree1 4.5 Region1 SiteA Tree2 3.9 Region1 SiteA Tree3 2.6

I think there's a couple of other things to consider in this section:

1). Are all open formats by definition machine readable? E.g. Are all ascii formats machine readable (because they are ascii), or only ones that adhere to a standard?

2). What about databases? Is MySQL an a machine readable format? Or would we only want to recommend databases that can be exported (I'm thinking about how CouchDB just stores data in JSON). Is HDFS a machine readable format? It's an open source standard.

3). A slightly clearer definition of what "machine readable" means exactly. Is it using an open format with a standard? Has embedded metadata? I wish I had more insight, I'm still trying to wrap my head around it.

dlebauer commented 9 years ago

Perhaps a key point is "for smaller datasets, use flat files because they are easy and accessible". I propose a definition and methods for handling bigger data issues in rule 9 (#39)

Are databases or NetCDF useful for small datasets? I am not sure, except where the small dataset is a subset of a larger one that is already stored in NetCDF or a database.

To clarify a few points made by @emhart, though:

as of v4, netCDF is a front-end to HDF5, ie. in many ways they are the same, and can use many of the same tools (e.g. NCO) for computing 'on disk' and across similar variables in the data heirarchy (e.g. from different members of a stochastic simulation or model ensemble).
In principle, i am not sure if there is a difference between what these can do and a RDBMS, except that HDF5/NetCDF4 can be optimized for raster / spatial data
both of these have arbitrary levels of hierarchy, and both work well with gridded data
the number of dimensions can be high (climate data provided in this format generally have latitude, longitude, height and time dimensions for each of many variables in a raster)

In this example,

Region1/
|- SiteA/
|- Tree1, 4.5
|- Tree2, 3.9
|- Tree3, 2.6

This is the same representation (right?) as can be done in HDF5 or CSV as shown above or in a three-table relational database (tables regions, sites, trees with each region having >= 0 sites, and each site having >= 0 trees).

An alternative representation might be spatially explicit: a recursive PostGIS table in which the geometry column can represent regions and sites as vector polygons and trees as points, supporting queries to locate all trees within some site. Not sure how much of this is useful.

Again, I suggest that for data that fits in memory, such models are less essential than having tidy, i.e. easily useable, data.

timchurches commented 9 years ago

I know that "ASCII" is useful shorthand, but it is also jargon, and strictly it refers to a archaic 7-bit code table for unaccented Roman characters. Not all science is done in English. Better to recommend Unicode in general, and UTF-8 in particular. UTF-8 is of course backwardly compatible with ASCII, or rather, ASCII-encoded data is also UTF-8 data.

emhart / 10-simple-rules-data-storage

Draft for Rule 10 #48