Tabular data - Githubissues

davidread commented 6 years ago

Category: Data

Challenge Owner

David Read, tech arch at MoJ's Analytical Platform. Background with GDS on: data.gov.uk, Better Use of Data team.

Short Description

Tabular data (e.g. CSV) is the most common data format but it is loosely defined and users would benefit from standardizing on the details.

This challenge is not about Excel workbooks or similar. It is about data that is primarily consumed by machines/software, rather than humans.

This challenge is not about metadata (e.g. schema / column types, licence) or validation. That's covered in challenge #40 and the options, including CSV on the Web and Tabular Data Package are both about putting metadata in a separate file, so is a separate conversation.

User Need

Off the top of my head:

As a data scientist I want to open the file directly in Python or R into a tabular data structure (e.g. DataFrame) without having to wrangle it (frictionless access to data) so that I can efficiently analyse it
As a developer I want to do simple processing of the data in a range of simple software (e.g. command-line bash, Javascript, Java)
As a citizen with low data literacy I want to open the file directly into Excel to view the table so that I can browse government data (secondary use case)

Expected Benefits

We want to encourage government users and citizens to use government data more, for greater understanding and decision-making. There are plenty of barriers to this, including skills, tools, access, licencing etc but one small but significant one is a proliferation of usage of CSV. These often require users to do extra work::

configure 'dialect', such as quote character, escape character, line ending
collapse multiple header rows into one, or create a missing one
character encoding conversion (happily this is covered by an existing government open standard)

Examples of bad tabular data:

ONS generally put metadata in the CSV e.g. https://www.ons.gov.uk/generator?format=csv&uri=/economy/grossdomesticproductgdp/timeseries/l8gg/qna
Defra example where lots of tables are into a single CSV e.g. https://data.gov.uk/dataset/f0f1a7b9-4a56-48c9-b0b3-99482e7d6980/basic-horticultural-statistics
Cabinet Office example where you'd have to combine 2 rows to get the headers https://data.gov.uk/dataset/14992eea-4f20-479d-8314-d5b08f1a9b9f/cabinet-office-annual-report-and-accounts-2012-13/datafile/803467f0-b2cc-4bde-9fc7-8b14646f8774/preview

Functional Needs

The functional needs that the proposal must address.

frankieroberto commented 5 years ago

@davidread good summary. However I don’t think it’s just Excel which can use the BOM. Most web browsers can too – and I think they're an important use-case, as lots of people like to preview a CSV file before downloading it (to check what’s in it, or that they’ve found the right one). You don’t need a BOM to make the CSV’s render in browsers, but it can be easier than making sure the files a served using the UTF8 character set in the MIME type.

Does anyone have R Studio or SPSS that can check whether they cope with the BOM or not?

And can you give any examples of UNIX tools not liking the BOM? I just tried grep on macOS, and it worked fine. Ruby seems to handle it too.

Based on this, I still think we should recommend a BOM, if possible (as well as the text/csv; charset=utf-8 content-type http header) – but acknowledge that this isn’t always possible, that there are pros and cons, and that it’s not part of the standard but is supplemental guidance.

arnau commented 5 years ago

@frankieroberto the one that I mentioned before is cat. When you concatenate two files with BOM with a tool that does not deal with encoding (e.g. cat) the result is a file with a BOM in the middle of the file, which is wrong. I'll try to find some time to test if this behaviour can be generalised to other tools/languages for this operation specifically.

The one I use, xsv handles it correctly although it drops the BOM on the resulting file which makes the result not ready for publication (if we require BOM that is).

frankieroberto commented 5 years ago

@arnau ah, that's interesting. Does cat do the same for UTF16 files?

The Wikipedia page on the BOM mentions "many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics".

I don’t think anyone is suggesting we require (or forbid) the BOM? This is just about guidance.

Ultimately, the decision is down to service teams, who may need to decide between prioritising Excel/web browser/desktop application users vs automated/unix pipeline users?

davidread commented 5 years ago

Yes I think you're right @frankieroberto, the outcome of the BOM discussion is just a recommendation accompanying the choice of standard, so holds less weight, but this discussion may help inform service teams.

I just did some playing with Excel 2016 for Mac (16.16.9):

opening a CSV without BOM - opens fine - just mangles the £. This is fixed if you use 'Text Import Wizard'.
opening a CSV with BOM - takes me to the import wizard where it detects UTF8 fine (I think it went to the wizard because it was such a small file to use heuristics to detect the column separator, so not a worry)
"save as CSV" - in the drop-down of "Common formats" it offers "CSV UTF-8 (Comma delimited)" and it gets a BOM. Tucked away in "Specialty formats" are also "Comma separated values", "Macintosh Comma Separated" and "MS-DOS Comma Separated" which all have no BOM but encodes the "£" as a single byte - i.e. ASCII8 encoded
- round-trip with BOM (open a CSV with BOM and save it again) - works fine - it comes out with a BOM
- round-trip without BOM (open a CSV without BOM and save it again) - had to use 'Text Import Wizard' to open, and can't find a way to save as UTF8 without BOM.

So in general Excel fared a bit better than what you got @nacnudus - I guess Excel has improved a bit between 2010 and 2016. Now that Excel has the new way to save UTF-8 with a BOM, it is working sensibly with a UTF-8 and a BOM. However Excel still doesn't read UTF-8 CSVs without a BOM, unless you use the Text Import Wizard. So @nacnudus I reckon this somewhat undermines the 'Excel is rubbish at UTF-8, whatever we do' argument.

@frankieroberto excellent find that grep is now fine about BOMs. I've tried sort, wc and cat (on Mac) and all just skip the BOM fine.

Latest summary

For BOM:

Excel is not great with UTF8 generally. Without a BOM, you need to use Text Import Wizard, otherwise non-ASCII characters get mangled.

A bit against BOM (accepts it fine, but drops it on output):

Most unix tools: grep, cat, sort, wc
Command-line utils: csvkit, xsv
R & Python libraries - read, readr & pandas are frictionless reading a BOM, but saving requires an option. And Python's standard 'csv' library does need an option to read a BOM correctly.

arnau commented 5 years ago

@davidread you said:

[...] cat (on Mac) and all just skip the BOM fine.

Out of curiosity, what did you do that works fine? My test looks like this:

$ curl -o utf8_bom.csv https://csv-encoding-test.herokuapp.com/csv/is-utf8-with-bom/as-utf8
$ echo "" >> utf8_bom.csv
$ cat utf8_bom.csv utf8_bom.csv > mix.csv

Results in a broken utf-8 file:

$ bat -p mix.csv
<U+FEFF>CSV, £
<U+FEFF>CSV, £

Using MacOS 10.13.6, curl 7.54.0, default cat, default echo and bat to clearly show BOM.

Also, is anyone testing Excel on Windows, I don't have access to one?

gheye commented 5 years ago

Hi,

I am working as a data architect at GDS looking at Data Standards.

I would like to lend my support to the introduction of this standard.

My one comment is:

I would in like to see it extended, in government, so there should always be a header row at the top of a file.

I understand for legacy applications and flows this may not be ideal at the moment.

Best Wishes,

Gareth H.

rufuspollock commented 5 years ago

Just to say i'm loving this thread and the detail re stuff like BOM - it is really useful.

Also to understand, is this just about standardizing on a "CSV Spec" and, if so, is the Frictionless Data CSV Dialect spec https://frictionlessdata.io/specs/csv-dialect/ relevant here (even if just informative) (I'm also thinking whether CSV dialect needs extending with a BOM field).

davidread commented 5 years ago

@gheye Great! Yes I'm keen to encourage a header row. Very much a user need.

@ldodds mentions that CSVW takes a view on CSV best practices which mostly echoes comments here:

start with RFC4180 - we agree
also allow unix line endings - this would suit many of the techies and their default tools, but they are a fraction of the users, compared to the Excel users. I'm not clear yet whether Open Standards Board would allow guidance that contravenes the RFC, so I'm tempted to be agnostic for now.
default to UTF8 but allow other encodings - declaring a different encoding in the HTTP header only works if the CSV is used in-place "on the web" - rather incompatible with the average user that downloads it before loading it into Excel etc. (although I love the CSVW vision). Gov has agreed to adopt UTF8 only, so this isn't really an issue.

@arnau Ah, you're dead right - thanks for this. I was simply going by what was echoed to the terminal, but yes when you dump the output it's clear the BOM is not dropped. 'cat', 'grep' & 'sort' are being dumb about handling the BOM, treating it as part of the first cell. So basic unix tools don't like BOMs.

Latest summary

For BOM:

Excel is not great with UTF8 generally. Without a BOM, you need to use Text Import Wizard, otherwise non-ASCII characters get mangled.

A bit against BOM (aware of it, but has friction working with it):

Command-line utils: csvkit, xsv
R & Python libraries - read, readr & pandas are frictionless reading a BOM, but saving requires an option. And Python's standard 'csv' library does need an option to read a BOM correctly.

Against BOM (not aware, lots of friction):

Most unix tools: grep, cat, sort, wc

frankieroberto commented 5 years ago

@davidread in the 'For BOM' section, can you add something like "Allows web browsers to use the correct encoding when it’s not specified in the HTTP header"?

davidread commented 5 years ago

I think [web browsers are] an important use-case, as lots of people like to preview a CSV file before downloading it (to check what’s in it, or that they’ve found the right one)

@frankieroberto I definitely agree with the user need to preview the CSV content. However I'm intrigued by you talking about displaying the file in the browser, because my experience is that when you click on a CSV link, the browser invites you to save it to disk and maybe open in Excel (depending on file associations), rather than displaying it in the browser. That's trying it with Chrome & FF on GOV.UK, data.gov.uk and Racial Facts and Figures, for example. When do you/users see it display in the browser?

gheye commented 5 years ago

HI All,

I think that we should ideally get RFC4180 accepted as a baseline standard and then raise an issue to extend it.

We need a baseline first.

Perhaps get it accepted with known issues that require discussion/agreement.

Thanks.

Gareth H.

frankieroberto commented 5 years ago

@davidread Chrome and Firefox both seem to download all text/csv files, but Safari lets you view them in browser: (eg this one which I found via data.gov.uk).

For https://csv-encoding-test.herokuapp.com I cheated slightly (for the sake of testing) by setting the content-type to text/plain, which then does render in all browsers (and they all seem to use the BOM, in the absence of an explicit encoding set within the content-type header).

So yeah, in short, it only really affects Safari. And probably there are better ways to preview CSV files.

arnau commented 5 years ago

@davidread did we get to any conclusion regarding BOM? Is there anything else we can help with to move forward this proposal?

davidread commented 5 years ago

Thanks @arnau. @frankieroberto I've add in the Safari one point and the latest summary is:

For BOM:

Excel is not great with UTF8 generally. Without a BOM, non-ASCII characters get mangled. (This can be overcome by using Text Import Wizard.)
Safari uses BOM to display CSVs when the header doesn't set the encoding

A bit against BOM (aware of it, but has friction working with it):

Command-line utils: csvkit, xsv
R & Python libraries - read, readr & pandas are frictionless reading a BOM, but saving requires an option. And Python's standard 'csv' library does need an option to read a BOM correctly.

Against BOM (not aware, lots of friction):

Most unix tools: grep, cat, sort, wc

Thinking about user needs, I think we should discourage BOM use. This will minimize friction with the command-line tools and programming languages, most used by analysts, statisticians, data scientists etc. We choose this with the awareness that it is not ideal for Excel users, who will find any non-ASCII characters (e.g. "£") appear incorrectly, but this does not get in the way of them doing basic analysis characteristic of Excel users - calculate a total, draw a chart, filter rows etc. And where it is an issue they can use the 'Import Text' option instead.

I'm taking this to the board tomorrow. I'm very happy to debate this BOM conclusion further, in terms of the user needs. However I think the BOM decision is a small thing in the scheme of things.

davidread commented 4 years ago

I'm pleased to report that this has been accepted by the Open Standards Board! Thanks to everyone for contributing their experience and considered opinions to this discussion and supported this.

It will be confirmed by the Board but my understanding from the meeting is:

The board recommends RFC 4180 for publishing machine-readable tabular data. It will advise that:

Character encoding - ASCII & UTF-8 are the existing standards for character encoding
Line endings - LF (Unix-style) are also acceptable
Header rows - one header row is recommended (not zero)

I didn't push for discussion of allowing Byte Order Mark (BOM), following discussion of compatibility and user needs on this issue. So because it is not part of RFC4180 I conclude it is not part of the Open Standards Board recommendation.

The board decided this should be a "recommendation", rather than "mandatory", because the use of tabular data is a huge area and deciding where to draw the line, and enforcing it, is not a realistic task for the Open Standards Board.

The basic message is: CSV is a good choice for tabular data. And if you're publishing a CSV, publish it according to RFC4180 by default.

nacnudus commented 4 years ago

Fantastic news. Thank you @davidread for writing the proposal and marshalling it through to acceptance. Here's to clean, standard tabular data!

andyjpb commented 4 years ago

Thanks @davidread! It's a great piece of work with important implications for the data ecosystem both inside and outside Government. I'm really glad you got it through and I know that you've been working hard on it for a long time.

Lawrence-G commented 4 years ago

The standard profile document is now published on GOV.UK

https://www.gov.uk/government/publications/recommended-open-standards-for-government/tabular-data-standard

frankieroberto commented 4 years ago

@Lawrence-G what's the difference between https://www.gov.uk/government/publications/recommended-open-standards-for-government and https://www.gov.uk/government/publications/open-standards-for-government?

Lawrence-G commented 4 years ago

The first is for standards recommended for use and the other is for compulsory standards. The two @frankieroberto categories have existed in theory in the past but until this year all were mandated by the OSB ( including some where the proposal was to recommend) so we have had to create a second page to list them in. We are going to rework the standards list soon as we feel the two pages of simple lists is probably not the best solution.

frankieroberto commented 4 years ago

@Lawrence-G ah, I see. So RFC 4180 is only recommended but not mandated?

Merging the two lists makes sense to me, but in the short term, is it worth linking to the recommended ones from the collection page: https://www.gov.uk/government/collections/open-standards-for-government-data-and-technology (unless it is already and I've missed it) – that's where I'd normally look for existing standards when developing a service.

davidread commented 4 years ago

@frankieroberto Yeah, the board decided this should be a "recommendation", rather than "mandatory", because the use of tabular data is a huge area. I made an attempt to define the scope of "use cases where CSV is best" in the proposal, and came across plenty of exceptions pretty quickly. Deciding where to draw that line between CSV and Excel/ODS/JSON/Parquet/Arrow/NetCDF etc., and enforcing it, is not a realistic task for the Open Standards Board. So the message is "CSV is a useful 'lowest common denominator' / general purpose tabular format" and "when you publish CSV, make sure it is RFC4180"

co-cddo / open-standards

Tabular data #58

Challenge Owner

Short Description

User Need

Expected Benefits

Examples of bad tabular data:

Functional Needs