datacarpentry / spreadsheet-ecology-lesson

Data Organization in Spreadsheets for Ecologists
https://datacarpentry.org/spreadsheet-ecology-lesson
Other
37 stars 141 forks source link

Link to or incorporate list of CSV gotchas/mistaken assumptions #230

Closed ctb closed 3 years ago

ctb commented 6 years ago

From https://donatstudios.com/Falsehoods-Programmers-Believe-About-CSVs by @donatj --

Everything on this list is a false assumption that developers make.

All CSVs are ASCII
All CSVs are Win1252
All CSVs are in 8-bit encodings
All CSVs are UTF-8
All CSVs are UTF-16
All CSVs contains a single consistent encoding
All records contain a single consistent encoding
All fields contain a single consistent encoding
All CSVs contain records
All records contain fields
Fields never contain record separators
Fields never contain delimiters
Fields never contain control characters
Delimiters are escaped with a \
All fields are enclosed by double quotes
All records are a single line
All lines contain a single record
All records contain the same number of fields
All records contain the same number of fields as the header
All records contain the same number of fields or fewer than the header
All CSVs contain a header
All record separators are CRLF
All record separators are LF
All record separators are a single byte
All record separators are a single rune
All newlines are a single byte
All CSVs are delimited with a comma
All CSVs are delimited with a comma, tab or semicolon
TSV isn't CSV
All delimiters are a single byte
All commas are a single byte
All CSVs are readable with Excel
Excel is a good tool for working with CSVs
Excel is an OK tool for working with CSVs
Excel can losslessly save CSVs it opens
Using ="{value}" is a good way to get around Excel auto-formatting
The first line will never be a poorly supported instruction header
Using sep={char} is a good way to get Excel to accept your delimiter
Prepending a BOM is a good way to get Excel to read your encoding
You can safely name your first column "ID"
All CSVs follow RFC4180
Most CSVs follow RFC4180
All CSVs follow the same defined standard
All CSVs follow a defined standard
All CSVs have a .csv extension
All CSV is human readable
Please take these into consideration next time you find yourself working with CSV. If you can think of anything I may have missed I'd be happy to add it.
ctb commented 6 years ago

pushback: https://news.ycombinator.com/item?id=16810193

hoytpr commented 6 years ago

Great information @ctb. @ErinBecker @cbahlai @ethanwhite @tracykteal Should we ask @ctb to work this into a PR?

:-) More seriously, We can put a link to the info in the lesson, but as the pushback indicates, their are difficulties with defining valid .csv for particular applications. Coming somewhat from the client side, yet also running a service core that deals with .csv (and Excel) extensively, this is important and complex. Can we acknowledge these issues and then define our currrent standard for "Ecology" that will also be compliant with other array-like data sets in the Carpentries? (This assumes Excel can open what developers consider a properly formatted .csv file) The end-goal is clear reproducibility. I'd be very interested to know what we could agree on.

ctb commented 6 years ago

:)

I think it is useful to have a link to this somewhere, basically. It will if nothing else provide further context for some statements and decisions.

I'm happy to work it into a PR!