Tabular data - Githubissues

davidread commented 6 years ago

Category: Data

Challenge Owner

David Read, tech arch at MoJ's Analytical Platform. Background with GDS on: data.gov.uk, Better Use of Data team.

Short Description

Tabular data (e.g. CSV) is the most common data format but it is loosely defined and users would benefit from standardizing on the details.

This challenge is not about Excel workbooks or similar. It is about data that is primarily consumed by machines/software, rather than humans.

This challenge is not about metadata (e.g. schema / column types, licence) or validation. That's covered in challenge #40 and the options, including CSV on the Web and Tabular Data Package are both about putting metadata in a separate file, so is a separate conversation.

User Need

Off the top of my head:

As a data scientist I want to open the file directly in Python or R into a tabular data structure (e.g. DataFrame) without having to wrangle it (frictionless access to data) so that I can efficiently analyse it
As a developer I want to do simple processing of the data in a range of simple software (e.g. command-line bash, Javascript, Java)
As a citizen with low data literacy I want to open the file directly into Excel to view the table so that I can browse government data (secondary use case)

Expected Benefits

We want to encourage government users and citizens to use government data more, for greater understanding and decision-making. There are plenty of barriers to this, including skills, tools, access, licencing etc but one small but significant one is a proliferation of usage of CSV. These often require users to do extra work::

configure 'dialect', such as quote character, escape character, line ending
collapse multiple header rows into one, or create a missing one
character encoding conversion (happily this is covered by an existing government open standard)

Examples of bad tabular data:

ONS generally put metadata in the CSV e.g. https://www.ons.gov.uk/generator?format=csv&uri=/economy/grossdomesticproductgdp/timeseries/l8gg/qna
Defra example where lots of tables are into a single CSV e.g. https://data.gov.uk/dataset/f0f1a7b9-4a56-48c9-b0b3-99482e7d6980/basic-horticultural-statistics
Cabinet Office example where you'd have to combine 2 rows to get the headers https://data.gov.uk/dataset/14992eea-4f20-479d-8314-d5b08f1a9b9f/cabinet-office-annual-report-and-accounts-2012-13/datafile/803467f0-b2cc-4bde-9fc7-8b14646f8774/preview

Functional Needs

The functional needs that the proposal must address.

edent commented 6 years ago

Thanks for this David. Let me provide some background which might help with this challenge.

The Open Standards Board does not create standards. We look at what existing standards best meet our users' needs. This means that no one ever has to develop a interoperability library with a UK Gov specific standard. In this case, we would look to see what CSV standards there are, and seek to agree on one.

At the moment, CSV is covered by our UTF-8 standard. That is, if it is a plain text file, written in Unicode, encoded with UTF-8, it meets our definition of "open".

We had previously looked at RFC4180 - but it does present some problems.

Requires CR/LF as line endings. Mac and Linux tend to produce just LF endings.
It specifies ASCII - although notes other encodings could be used.
It is not a living standard. The only updates have been around meta-data.

So, my question to you is - which bodies are working on CSV standardisation?

davidread commented 6 years ago

ODI have written guidelines for a tightly specified CSV to accompany their CSV Lint checker tool: https://csvlint.io/about#how_write_csv No doubt there are others - let's here more suggestions.

edent commented 6 years ago

Interesting looking at the results on a couple of CSVs https://csvlint.io/package/5ad5c19193ede3000900001d

If anyone knows of other validators and/or standards, we would be very interested.

davidread commented 6 years ago

Tool-wise, there's also the OKI Good Tables tool: http://goodtables.okfnlabs.org/ which seems pretty similar, but in python rather than ruby.

@edent I get that the Open Standards Board doesn't want to get into the process of developing a standard. But there would be some merit in agreeing standards amongst gov. Maybe we should instead aim to get something into the Tech Code of Practice?

edent commented 6 years ago

To clarify, we're happy to agree on an existing standard. But we won't create a new standard specifically for Government.

MikeThacker1 commented 6 years ago

CSV on the web http://w3c.github.io/csvw/ is the obvious candidate - validated via CSV Lint. CSV schemas are defined in JSON format.

The LGA's CSV validator http://validator.opendata.esd.org.uk/ uses schemas themselves defined in CSV (downloadable from that address but not documented in detail).

This https://standards.porism.com/doc.html?#/tools?id=spreadsheet-validation might help.

Mike

davidread commented 6 years ago

@MikeThacker1 Maybe I've missed something here, but CSVW defines a schema file that is a partner for the CSV file. This challenge is about tightening up the CSV file itself.

MikeThacker1 commented 6 years ago

Sorry David. You're right. My mistake.

On 17 April 2018 at 12:23, David Read notifications@github.com wrote:

@MikeThacker1 https://github.com/MikeThacker1 Maybe I've missed something here, but CSVW defines a schema file that is a partner for the CSV file. This challenge is about tightening up the CSV file itself.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alphagov/open-standards/issues/58#issuecomment-381955362, or mute the thread https://github.com/notifications/unsubscribe-auth/AKVSnYUPWQk5qMWVUBgTa3B_XNxZi-VWks5tpdDBgaJpZM4TX7Gr .

porism.com

edent commented 6 years ago

@davidread Do you know how many CSVs hosted on GOV.UK don't meet your use cases?

I'm trying to understand the scale of the problem.

I've picked a random sample, and they all opened fine in LibreOffice and were read by Python/Pandas.

Do you have some examples of files which don't? Thanks.

davidread commented 6 years ago

I've extended "Examples of bad tabular data" to include some examples that we came across today.

I don't have numbers. GOV.UK doesn't put their downloads in their API - you'd need to scrape the HTML (ugh). data.gov.uk do have an API. I've simply not done the analysis.

The ONS example above is what sparked me to write this, and I think it is true for all their thousand-odd datasets on their newish site. A significant proportion of statistical analyses use ONS's data that discounts for inflation, for example.

Traditionally a stastician would open this data in Excel and copy/paste them into their calculations sheet. We're trying to persuade them to change the way stats are created to be a reproducible R/Python script, where the first line downloads the latest data from the source and its dropped into a DataFrame.

Sure, R & Pandas have lots of auto-magic and options to deal with most of these annoyances, but not all. But those are the best tools, and require tinkering to get it right every time. It's unnecessary friction. It discourages casual & exploratory use, which is at the root of all data work.

It seems such a simple thing to agree what makes a "good-enough" CSV and it would be a small but useful push in the right direction.

nacnudus commented 6 years ago

Another angle to take would be to add an automatic csvlint step when publishing CSVs to alert publishers to any problems.

edent commented 6 years ago

There are three distinct problems here

CSVs which are corrupt.
CSVs which have poor structure / data.
CSVs which contain data useful to a person reading the file, but annoying for automatic machine readability.

The first I agree could be solved by running a linter on upload to check for UTF-8, proper escapes, line endings, and similar.

The second can only be solved by contacting the producer of data and explaining the problem.

As for the third... I think this will be a difficult issue to solve satisfactorily. Many of the users of the datasets are happy to open in a spreadsheet app, then manually manipulate the data. The metadata will be actively helpful for them.

This could be solved by using CSV on the Web - but it would require a massive change to the way publishers work. I suspect most of them design a complex spreadsheet and then just hit "Save As CSV".

We have three users - the producer, the consumer, the computer. Making a mandatory change to "no metadata" CSVs places a burden on the producer and consumer for the benefit of automatic tools.

So, actions for all of us:

We should agree on an acceptable open-source CSV validator, see how many current CSVs would fail, and see if it can be integrated into the publishing process. @stevenjmesser to look into if this can be added to the publishing workflow.
@DavidRead can you contact some of the departments to explain the issue and see whether they can produce either suitable CSVs or a JSON API for you to connect to?
Let's gather some examples of CSVs published in 2018 which fail to meet our expectations so we can understand the range of problems.

davidread commented 6 years ago

@nacnudus I like it - this seems like a useful step, although the user would have to be amenable. I feel it really needs to be backed up by having a standard for two other groups of data publishers:

to push any heel-dragging data publishers up to an acceptable level
to see if we can find consensus, by pooling bodies of expert opinion and user research from the likes of ONS, because we currently have different views

davidread commented 6 years ago

@edent I like your suggested actions and I'll look to get some more recent evidence and see if I can bring into the discussion some departments with differeing views.

rufuspollock commented 6 years ago

Frictionless Data specs are directly relevant here:

Table Schema: http://frictionlessdata.io/specs/table-schema/
Tabular Data Package (CSV + Table Schema): https://frictionlessdata.io/specs/tabular-data-package

There's a lot of tooling support for these with libs in lots of languages and a very full featured validator library and service http://github.com/frictionlessdata/goodtables-py & https://goodtables.io/ (http://goodtables.okfnlabs.org/).

I note goodtables includes both structural checking (blank rows, columns etc) and validation against table schema.

We also have extensive tools for inferring schemas from CSVs e.g. infer method in https://github.com/frictionlessdata/datapackage-py and https://github.com/frictionlessdata/datapackage-js

I'd also comment that Table Schema and Tabular Data Package were original inspiration for W3C work (and I was an editor / contributor on both specs) but Table Schema / Tabular Data Package are a lot simpler specs relatively which is reflected in the breadth of implementation support (e.g. table schema lib in multiple languages, table schema support in pandas etc).

davidread commented 6 years ago

@rufuspollock Let's not get distracted by schemas or tooling here - that's a big world and is covered by other challenges.

The focus of this challenge is simply to agree on a CSV standard that makes it easy to load using stock CSV libraries everywhere.

danbri commented 6 years ago

As a data scientist I want to open the file directly in Python or R into a tabular data structure (e.g. DataFrame) without having to wrangle it (frictionless access to data) so that I can efficiently analyse it

If [our hypothetical user] is going to call themselves a scientist, they ought to feel at least some obligation to read the footnotes, caveats, supporting documentation, and think about the less obvious characteristics of the data. That may sound harsh and I don't intend it meanly, but this is currently a huge problem with scaling up the modern machine learning scene to engage with more chaotic real world datasets.

davidread commented 6 years ago

@danbri I agree that supporting documentation should be available alongside the data, but that's not what I meant this to be about. I can see that "frictionless access to data" is suggestive of how it might be downloaded, but I mean it in terms of going from a CSV file on your local disk to a dataframe or equivalent data structure in common languages.

nacnudus commented 6 years ago

I agree with @davidread, that this proposal is about the fact that common tabular data formats don't behave as well as, say, JSON, so we should standardise on one that does behave well.

edent commented 6 years ago

@davidread @nacnudus can you suggest a good standard that we could look at?

davidread commented 6 years ago

So, what about RFC4180?

I previously suggested the ODI CSV guidelines but actually RFC4180 (which you suggested) covers most of these, such as each row being a record, requiring the same number of fields for every row, not allowing multiple header rows, not allowing blank rows.

My one disagrement with RFC4180 is that the header row is optional - I had an annoyance with a header-less meterological dataset before. But I don't see this as a big deal.

I'm actually ok with the Windows line endings, because that has the huge advantage that it is compatible with Excel over many years. Excel is the widest used piece of software in the data space, and only outputs Windows line endings. Arguably we cause a bigger issue for Excel users if we select mac/unix line endings than vice-versa. So Windows line endings seem best at meeting user needs to me, in this case. (The ODI guidelines agree too.) Happy to hear any other thoughts and reasoning.

RFC4180 allows the gov's standard of UTF8, so there seems no issue there.

You say RFC4180 is not a living standard, and IETF describe it as 'informational' rather than a standard. But I notice they did publish an errata only last month. It's not changed much, but maybe it doesn't need to. And actually one of its advantages is that it's simple and has been around for decades.

The problems I've been talking about with CSVs are abuses compared to RFC4180. If we can agree on and bless this one then it could be a useful push to publishers to fall into line and make life easier for government data users.

nacnudus commented 6 years ago

Here's a recent paper that analyses a random sample of 80 CSV files from data.gov.uk, considers RFC 4180 and some state-of-the-art CSV parsers in R and Python.

I would summarise the paper, but the whole thing seems relevant.

edent commented 6 years ago

OK. Let's get started with this challenge. I suggest that this be specifically for datasets published as "Machine Readable Data". That is, if users want to publish data as CSV and it is intended to be consumed by humans, that's fine.

@Lawrence-G please can you start the assessment of https://tools.ietf.org/html/rfc4180

@davidread please can you find someone from data.gov.uk and GOV.UK to help us understand whether a validator can be added to their publishing process?

rufuspollock commented 6 years ago

@davidread

The focus of this challenge is simply to agree on a CSV standard that makes it easy to load using stock CSV libraries everywhere.

OK that's clearer to me. I'd suggest you'd still want to look at http://frictionlessdata.io/specs/csv-dialect/ because it will allow you describe / validate a bunch more CSV. Many CSVs in the wild don't limit themselves to compliance with https://tools.ietf.org/html/rfc4180 i.e. they don't always use "," as the separator, they have different line endings etc.

@davidread @edent

@davidread please can you find someone from data.gov.uk and GOV.UK to help us understand whether a validator can be added to their publishing process?

i would also mention that goodtables does have a structure (not schema) validator that helps you pick up a lot of common CSV structure bugs e.g. missing headers, blank rows, missing columns etc https://github.com/frictionlessdata/goodtables-py

Lawrence-G commented 5 years ago

Our assement of RFC 4180 using the CAMSS based list of questions

Formal specification

Q. 1. Does it address and aid interoperability between public administrations?

A. Yes RFC 4180 should allow data in comma separated value (CSV) file format to be shared, processed and analysed more easily.

Q. 2. Does it address and aid the development of digital services in government?

A. Yes Systems that use data should be easier to develop for tabular data that is published in a simple machine readable format.

Q. 3. Are the functional and non-functional requirements for the use and implementation of the specification clearly defined?

A. Yes There are few requirements for this standard and they are well documented.

Q. 4. Is it possible to implement the specification across different domains?

A. Yes Any domain where data is stored and exchanged as text in comma separated value (CSV) format.

Q. 5. Is it largely independent from products of single providers, either open source or proprietary?

A. Yes This standard is Independent of any supplier.

Q. 6. Is it largely independent from specific platforms?

A. Yes Applies to text files in CSV format and is applicable to all platforms

Q. 7. Has the standard been written so that it can be delivered or used with more than one technology (for example XML and JSON)?

A. No This standard only applies to CSV text files

Q. 8. Has the specification been sufficiently developed and existed long enough to overcome most of its initial problems?

A. Yes Originally drafted in 2003

Q. 9. Are there existing or planned mechanisms to assess its conformity and implementation - for example conformity tests, certifications and plugfests?

A Yes. There are several tools that can be used on and offline. Perhaps the best known is CSV lint - https://csvlint.io/ from the ODI

Q. 10. Does it have sufficient detail, consistency and completeness for the use and development of products?

A. Yes Though scope is limited to tabular data

Implementation of the formal specification

Q. 11. Does it provide current implementation guidelines and documentation for the implementation of products?

A. Yes The documentation has notes for implementation.

Q. 12. Does it provide a reference (or open source) implementation?

A. No Not required as this is a simple standard to implement. There are examples available online that conform to the standard

Q. 13. Does it address backwards compatibility with previous versions?

A. Not Applicable There are no previous versions of RFC 4180

Q. 14. Are the underlying technologies for implementing it proven, stable and clearly defined?

A. Yes Underlying technologies are encoded text arranged in tabular form e.g UTF-8

Openness

Q. 15. Is information on the terms and policies for the establishment and operation of the standardisation organisation publicly available?

A. Yes The information is available at http://www.ietf.org/about/

Q. 16. Is participation in the creation process of the formal specification open to all relevant stakeholders (such as organisations, companies or individuals)?

A. Yes IETF have a formal review and approval process. See - https://www.ietf.org/about/participate/

Q. 17. Is information on the standardisation process publicly available?

A. Yes See https://www.ietf.org/standards/process/

Q. 18. Is information on the decision-making process for approving formal specifications is publicly available?

A. Yes See https://www.ietf.org/standards/process/role-iesg-standards-process/

Q. 19. Are the formal specifications approved in a decision-making process which aims at reaching consensus?

A. Yes “Any action made by an Area Director or the IESG may by made the subject of the conflict resolution mechanisms set out in RFC 2026, the primary aim of which is to resolve conflicts and move the IETF as a whole towards consensus.” see https://www.ietf.org/standards/process/appeals/

Q. 20. Are the formal specifications reviewed using a formal review process with all relevant external stakeholders (such as public consultation)?

A. Yes Explained in on this page https://www.ietf.org/standards/process/informal/

Q. 21. Can all relevant stakeholders formally appeal or raise objections to the development and approval of formal specifications?

A. Yes See the process here - https://datatracker.ietf.org/doc/rfc2026/

Q. 22. Is relevant documentation of the development and approval process of formal specifications publicly available (such as preliminary results and committee meeting notes)?

A. Yes IETF drafts are published publicly at http://www.ietf.org/id-info/ Discussions take place online in archived mailing lists

Access to the formal specification

Q. 23. Is the documentation publicly available for implementation and use at zero or low cost?

A. Yes RFC 4180 is free to view and download

Q. 24. Is the documentation of the intellectual property rights publicly available (is there a clear and complete set of licence terms)?

A. Yes See here - https://www.rfc-editor.org/info/bcp79

Q. 25. Is it licensed on a royalty-free basis?

A. Yes There are no royalties payable for use of RFC 4180

Versatility/flexibility of the proposed standard

Q. 26. Has the formal specification been used for different implementations by different vendors/suppliers?

A. Yes It is difficult to be sure but it is probable.

Q. 27. Has the formal specification been used in different industries, business sectors or functions?

A. Yes The standard is widely used where tabular data is stored and published

Q. 28. Has interoperability been demonstrated across different implementations by different vendors/suppliers?

A. Yes CSV format aids interoperability across products from different vendors e.g Excel, Google Sheets . This standard’s aim is to improve on this.

End user effect of the formal specification

Q. 29. Do the products that implement it have a significant market share of adoption?

A. Yes Popular spreadsheet and database programs are able to support RFC 4180

Q. 30. Do the products that implement it target a broad spectrum of end-uses?

A. Yes Use in all sectors, including government, industry and academia.

Q. 31. Does it have strong support from different interest groups?

A. Yes Including the ODI and The National Archive.

Q. 32. Is there evidence that the adoption of it supports improving efficiency and effectiveness of organisational process?

A. Yes It should do but difficult to find evidence for.

Q. 33. Is there evidence that the adoption of it makes it easier to migrate between different solutions from different providers?

A. Yes It should do but difficult to find evidence for

Q. 34. Is there evidence that the adoption of it positively impacts the environment?

A. No It should have no impact on the environment

Q. 35. Is there evidence that the adoption of it positively impacts financial costs?

A. Yes RFC 4180 will save time spent wrangling data by data scientists

Q. 35. Is there evidence that the adoption of it positively impacts security?

A. No RFC 4180 should not have any impact on security. Note the security considerations in the RFC https://tools.ietf.org/html/rfc4180#page-5

Q. 37. Is there evidence that the adoption of it can be implemented alongside enterprise security technologies?

A. No Though RFC 4180 is not concerned with security there are no barriers expected in implementing alongside enterprise security technologies.

Q. 38. Is there evidence that the adoption of it positively impacts privacy?

A. Not Applicable No privacy implications

Q. 39. Is it largely compatible with related (not alternative) formal specifications in the same area of application?

A. Yes CSV files can usually be exchanged between systems despite differences in formatting though some manual transformation may be required

Q. 40. Is there evidence that the adoption of it positively impacts accessibility and inclusion?

A. Not Applicable No accessibility and inclusion implications

Maintenance of the formal specification

Q. 41. Does it have a defined maintenance organisation?

A. Yes This is an IETF standard

Q. 42. Does the maintenance organisation provide sufficient finance and resource to control short-to-medium-term threats?

A. Yes The organisation has a steady funding stream. IETF gets money from meeting fees and sponsorship, including from the Internet Society, which itself is funded through membership.

Q. 43. Does the maintenance organisation have a public statement on intention to transfer responsibility for maintenance of it, if the organisation were no longer able to continue?

A. No No public statement found after extensive web search. We assume, therefore, that none exists. In lieu of a public statement, this standards body has been in continuous existence since 1986

Q. 44. Does it have a defined maintenance and support process?

A. Yes The IETF https://www.ietf.org/ as defined by https://www.ietf.org/about/standards-process.html

Q. 45. Does it have a defined policy for version management?

A. Yes An RFC has a clear lineage, citing other RFCs which are obsoleted, or which have supersede it.

Related European standards

Q. 46. Is this an existing European standard or an identified technical specification in Europe? (Note: CEN, CENELEC or ETSI are the European standards bodies. Technical specifications provided by organisations other than CEN, CENELEC or ETSI can be under consideration to become a European standard or an identified technical specification in Europe.)

A. No

Q. 47. Does this specification or standard cover an area different from those already identified or currently under consideration as an identified European standard or specification?

A. Yes

davidread commented 5 years ago

One thing that RFC4180 doesn't address is whether or not to have a BOM at the start of the file to indicate UTF8 encoding. However I don't think we can or should standardize to have a BOM or not. Leave it up to the publisher to decide. I just wanted to note it here with my reasoning, and in case anyone has further thoughts:

There's a lot of discussion e.g. this one. For me the key points are:

For BOM:

best compatibility with Excel (which is the widest used piece of software in the data space). If a CSV doesn't have a BOM then any non-ASCII characters will appear mangled in Excel, unless you use Excel's "Import Text" wizard and specify UTF8 manually.

Against BOM:

aside from Excel, the majority of software doesn't understand it, and you end up with the first column heading gets prepended with: "ï»¿". In Pandas you need v0.20.3+ and pd.read_csv(..., encoding='utf-8-sig') and in R do read.csv(..., fileEncoding = "UTF-8-BOM").
a BOM breaks ASCII compatibility, one of the advantages of UTF8.

So you have a choice between seamless use in Excel (with a BOM) or seamless use in everything else (without a bom). And I think this should remain a choice for people publishing CSVs in government.

stevenjmesser commented 5 years ago

please can you find someone from data.gov.uk and GOV.UK to help us understand whether a validator can be added to their publishing process?

Happy to help!

RobinL commented 5 years ago

Just wanted to add my support for RFC 4180. Speaking as an analyst, I think the important thing here is that there's a standard which is sensible. The specifics of the standard (within reason) are a bit less important than the fact I can go to senior staff, and say that a standard exists and we should follow it.

This then prevents publishers making simple mistakes in how open data is published (which sadly are all to frequent at the moment).

frankieroberto commented 5 years ago

@davidread do you have an update on this? eg is RFC 4180 likely to be adopted?

davidread commented 5 years ago

@frankieroberto tbh I've let this slide. But I'd very much like this to be taken forward. @Lawrence-G can you remind us what we need to do to move on this process?

Lawrence-G commented 5 years ago

HI, @davidread, If you are happy with the responses so far, the next step to produce a proposal for the adoption of RFC 4180 (and/or other standards). The proposal will be posted here and comments invited. Contact me directly if you'd like more information

davidread commented 5 years ago

I could do with some help writing this up, so please someone volunteer. I've made a start and dumped some of this existing material in a doc: https://docs.google.com/document/d/16cnjiMQ7KGsBlQVo9UsuRSL4xwnJkSIDR2Mal2B5vfA/edit# Email me (david.read@digital.justice.gov.uk) if you can help, and I'll give you write access. Or simply add comments.

RobinL commented 5 years ago

Dunno whether worth mentioning but ONS are starting to publish wcsv metadata data, eg here. This allows you to impose/validate the correct data types on data in csv format. I've not seen a parser that takes a csv and wcsv metadata to produce e.g. an R or a Python dataframe, but it would not be hard to do, and would be useful.

frankieroberto commented 5 years ago

@RobinL thanks – I think that's one of the options for #40? (Although there seem to be a few competing formats).

davidread commented 5 years ago

@RobinL you and others are right that tabular data can be nicely expressed by a pair of files - a CSV and a schema. It makes sense to debate whether that schema file should be CSVW, AWS Glue JSON, datapackage.json or any other sort, but I'm hoping that this is discussed in the sister Standards Challenge "Schemas for Tabular Data" #40.

Publishing schemas is an aspiration, but the lowest common denominator for tabular data is the focus of this Standard Challenge. Maybe it's clearer if I was to rename it "tabular data - minimum standard"? Arguably we should also mention it's only for machine-readable data too, else it looks like we're banning publishing spreadsheets. So I think we should just mention these caveats in the body of this standard.

So this Standard Challenge is about the CSV file itself. For those dozens of CSVs, TSVs, PSVs that government publishes every day. Rather than let every publisher choose a different file format and dialect, we can improve the users' lives by standardizing on one. We could benefit from agreeing that a CSV shouldn't spread its headers over multiple rows, doesn't have a total row at the bottom, and a "£" in a column header should only be expressed using UTF-8.

To push this on, if anyone has a chance to write about the Functional Needs section of the proposal, that'd be a help: https://docs.google.com/document/d/16cnjiMQ7KGsBlQVo9UsuRSL4xwnJkSIDR2Mal2B5vfA/edit#heading=h.5n8c56mms2vx

davidread commented 5 years ago

This paper describes the myriad technical difficulties of automatically parsing CSVs: https://hannes.muehleisen.org/ssdbm2017-muehleisen-csvs.pdf

It summarizes the problem like this: "Data scientists typically lose much time in importing and cleaning data, and large data repositories such as open government collections with tens of thousands of datasets remain under-exploited due to the high human cost of discovering, accessing and cleaning this data. CSV is the most commonly used data format in such repositories. The lack of explicit information on the CSV dialect, the table structure, and data types makes proper parsing tedious and error-prone."

The use cases I've provided so far are more about efficiencies for one-off interactions to data - e.g. a data scientists creating a data pipeline. This paper has prompted me to explore the use of and value of a large corpus of CSVs, if they can be automatically readable (i.e. if they follow a tabular data standard, as proposed). (Concerning principally government data, such as all the CSVs on data.gov.uk, GOV.UK & ONS) the value of a corpus would be:

ease of aggregating datasets spread across multiple CSVs. e.g. government transparency datasets (e.g. organizational spending, contracts, staff numbers, minister meetings, expenses) are published in monthly CSVs across hundreds of organizations. Analysis is curtailed by difficult parsing & aggregation (users: government itself, journalists, researchers, citizens etc)
to understand/appreciate government data as a whole, to understand and improve aspects across hundreds of organizations en masse, in terms of data best practices, improving data quality and getting value from data (policy areas: data driven organizations, data science, smart cities, transparency)
discoverability - a central browse / search, as an augmentation of metadata

maxwell8888 commented 5 years ago

@davidread that sounds very similar to the work the GSS data project is doing: https://gss.civilservice.gov.uk/guidance/the-gss-data-project/

nacnudus commented 5 years ago

@davidread your latest comment suggests a move away from the original challenge. Is that what you intend? The original challenge was more about standardising the file type so that it is easy to get it into memory with standard tools. Data quality, aggregation and search are far bigger problems that wouldn't be addressed by this standard, any more than the Open Document Format improves spreadsheets.

davidread commented 5 years ago

@davidread your latest comment suggests a move away from the original challenge. Is that what you intend?

@nacnadus Apologies if this misleads. The focus of this standard challenge is still on standardising the file type. I'm just talking about additional potential benefits if government data can be read into memory automatically en-masse. I agree that aggregating & searching data relies on more than just that, but reading it into memory is necessary. If the data format is not standardized it is a huge job to tinker with the parsing, to successfully read all the files.

davidread commented 5 years ago

@davidread that sounds very similar to the work the GSS data project is doing: https://gss.civilservice.gov.uk/guidance/the-gss-data-project/

@maxwell8888 thanks for that. It's good to see the project mentions "normalise the data into...Tidy Data", which appears to be a CSV, and of course we agree that is a very useful lowest common denominator.

However the main focus of that GSS project seems to be linking datasets into an n-dimensional RDF data cube, storing and publishing data using a commercial website, and providing live extracts of the data using APIs, all of which are up for debates but at another time and place. It suggests "CSV on the Web" as an output, but of course that is the schema / metadata - the standard has no view on the CSV file itself.

frankieroberto commented 5 years ago

Hi all. I referenced this standardisation from within this blog post: https://dataingovernment.blog.gov.uk/2019/05/02/a-discovery-into-data-publishing-formats/

I’d still love to see RFC4180 adopted as a baseline standard for CSVs, before we start thinking about metadata and schema standards. Is there any progress?

davidread commented 5 years ago

I've finished off a draft proposal for the Standards Board, which meets to discuss this in 3 weeks time. I'd love to get some more feedback on it, particularly from everyone here who've helped shape this, and particularly on this summary:

CSV is widely accepted as a general purpose open standard for machine-readable data. But when publishing CSV files, we’re all making different choices about how it is expressed - commas vs pipes, Windows vs Unix line endings etc. Let’s agree to follow RFC4180, which is pretty much what Excel exports CSVs as by default. And it reminds us that while a Excel sheet can be free-form, a CSV should be rigidly structured with a header row followed by record rows - don’t include metadata rows, or a row of totals at the bottom, for example. Adopting this will help users read the data into R/Python/etc for analysis and make better use of data that government publishes.

https://docs.google.com/document/d/16cnjiMQ7KGsBlQVo9UsuRSL4xwnJkSIDR2Mal2B5vfA/edit?ts=5cdac5f6&pli=1#

MikeThacker1 commented 5 years ago

Hello

I've not given this enough attention so far but am wholly supportive of it. Compliance would make reading Government data from multiple sources a lot easier.

I don't think the draft addresses representing multiple values within one cell. This is a common issue for us when we "squash down" one-to-many data into a spreadsheet format. For some time we / LGA / iStandUK has adopted the pipe as a means of delimiting multiple values in a single cell.

I'm not sure if it's worth making that part of this standard but thought I should mention it.

Mike

[image: Picture] Mike Thacker Porism

http://twitter.com/MikeThacker http://plus.google.com/+MikeThacker http://www.linkedin.com/company/porism

On Thu, 16 May 2019 at 12:25, David Read notifications@github.com wrote:

I've finished off a draft proposal for the Standards Board, which meets to discuss this in 3 weeks time. I'd love to get some more feedback on it, particularly from everyone here who've helped shape this, and particularly on this summary:

CSV is widely accepted as a general purpose open standard for machine-readable data. But when publishing CSV files, we’re all making different choices about how it is expressed - commas vs pipes, Windows vs Unix line endings etc. Let’s agree to follow RFC4180, which is pretty much what Excel exports CSVs as by default. And it reminds us that while a Excel sheet can be free-form, a CSV should be rigidly structured with a header row followed by record rows - don’t include metadata rows, or a row of totals at the bottom, for example. Adopting this will help users read the data into R/Python/etc for analysis and make better use of data that government publishes.

https://docs.google.com/document/d/16cnjiMQ7KGsBlQVo9UsuRSL4xwnJkSIDR2Mal2B5vfA/edit?ts=5cdac5f6&pli=1#

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alphagov/open-standards/issues/58?email_source=notifications&email_token=ACSVFHJDH3RYLCSCFFWPKETPVVAAVA5CNFSM4E27WGV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVRQHEY#issuecomment-493028243, or mute the thread https://github.com/notifications/unsubscribe-auth/ACSVFHLZUTYO73ENMXI3K7LPVVAAVANCNFSM4E27WGVQ .

G-Cloud and Digital Outcomes & Specialists supplier http://e-sd.org/dmmpJ/ of software, hosting and related services

nacnudus commented 5 years ago

Regarding multiple values within cells, I think the standard is right to omit that. A cell could contain any data that has been serialised to text, which could be delimited strings, JSON, HTML, SVG, binary, or anything else.

Often a better way to publish one-to-many data is in separate files with a common key, but I think any guidance about that kind of design decision should be separate from a technical standard about CSV files.

arnau commented 5 years ago

Hi, I'm a bit surprised on the proposal's position regarding the use of BOM. It says:

Byte Order Mark (BOM) - RFC 4180 does not specify it. However this government standard should encourage its presence. Excel on Windows includes a BOM when exporting as CSV in the presence of ASCII characters. And when Excel on Windows opens a CSV without a BOM, any UTF8 encoded characters are garbled (unless you resorting to the advanced import: Data> Get External Data> Import Text File). (Until the most recent version, Excel for Mac wouldn’t open UTF8 correctly by default, even with the BOM, so nothing we do is perfect). So in general it seems pragmatic, with Excel the most widely used data tool by a huge margin, we should encourage BOMs.

I was under the impression that BOM is more harmful than not. Have you done any study on the impact on forcing its presence beyond Excel?

An admittedly quick search on UTF-8 and BOM shows http://utf8everywhere.org/#faq.boms which says:

According to the Unicode Standard (v6.2, p.30): Use of a BOM is neither required nor recommended for UTF-8.

A read on the Unicode Standard suggests that BOM in UTF-8 has no effect in byte order (which is the purpose of a BOM say in UTF-16).

The utf8everywhere site also says:

Using BOMs would require all existing code to be aware of them, even in simple scenarios as file concatenation. This is unacceptable.

Which begs the question of "Who is this standard primarily for?" and "What is the state of the world regarding tooling in terms of BOM".

If I understood the goal of this proposal, the primary user is a professional that needs to publish and consume CSV. For example, it suggests:

Fields containing line breaks - allowing these is problematic for simply tools like ‘grep’ or ‘wc -l’

Which is fair enough but makes me think in what is the situation with other common tools like cat and what are the implications of encouraging the use of BOM.

My position would be to discourage the use of BOM and help Excel for Windows users in a different way. I'm concerned with getting something not quite right in the aim of helping too many users. What do you think?

frankieroberto commented 5 years ago

@arnau good questions!

The problem with CSV files is that, as plain text files, there's nowhere that specifies the character encoding once you've downloaded them. When you view them in a browser over HTTP, then the HTTP header should in theory specify the character encoding as part of the MIME type (text/csv; charset=utf-8. However, in practice, very few servers are configured to do this.

The BOM was invented for UTF-16 files (where's it’s required), and serves no practical purpose in UTF8 files other than as an indicator of the character encoding.

Without the BOM, and with the character encoding either missing from the HTTP header (or missing because it's being loaded from a local filesystem), applications have to either use heuristics to guess at which encoding is being used, or ask the user (who may not know).

I've yet to discover any applications which process UTF8 CSV files but choke on the BOM – however they may exist? But there are plenty of applications, including Excel and most web browsers (when rendering CSV files directly) for which the presence of the BOM makes a difference.

I put together this test website a while back which you can use to check your browser (or application, if you download first): https://csv-encoding-test.herokuapp.com/

davidread commented 5 years ago

@frankieroberto Yes, Windows tools (Excel, PowerBI, Tableau) detect the BOM and handle it ok, but I really think BOMs do cause problems in the *nix & open source data analysis world, which is very important too. Even Python and R with their huge data science communities have lots of friction. By default you either get some awful error message, or you end up with the first column heading prepended with: "ï»¿".

Examples:

In Pandas you need v0.20.3+ and pd.read_csv(..., encoding='utf-8-sig')
In R do read.csv(..., fileEncoding = "UTF-8-BOM")
@arnau gave the example of using cat, and of course there's usually lots of mileage in other *nix tools: grep, sort, wc -l etc. I expect there is a command-line tool to strip the BOM, though.
With csvkit you need to add an encoding parameter csvsql -e utf-8-sig

In all these cases, it's hard work figuring out what the "ï»¿" is and how to get rid of it. Plenty of friction.

If you don't have a BOM, then Excel will display incorrectly anything non-ascii, like those "£" signs. But maybe that's not too bad? (Maybe in a non-English-speaking country we'd have a different thought.) If we have to choose between avoiding horrid error messages in the popular open source data analysis tools and the avoiding the occasional symbol looking wrong in Excel, I think overally the former is better. But I'm very happy to hear more evidence that swings the balance.

Current conclusion: users will be happier without BOMs

This choice would come with the advantage that it is what the RFC says. But it does mean that when saving CSVs from Windows Excel, and there are non-ASCII characters, you can't simply use "Save As". Another step will be needed to remove the BOM. (Someone should probably write a good .xls->.csv converter!)

nacnudus commented 5 years ago

Based on the experiments below, I think the proposal should recommend no BOM.

Excel itself doesn't create files with a BOM, so an Excel->CSV->Excel round trip will never work. (@davidread this is not what you seem to have found)
The vast majority of CSV files are created by Excel, so people would ignore any recommendation to use a BOM.
Every GOV.UK csv file in a sample of ~50 that had a £ symbol had mangled it, so BOM or no BOM nothing could have read it.

I tried R read.csv() and readr::read_csv() and Python pandas.read_csv() and they worked whether or not the BOM was present.

Excel 2010 (on Windows 10, using the File Open dialogue) couldn't read a CSV containing £ correctly unless it had the BOM, but using Excel's "save as MS Dos .csv" mangles the £ and then creates a file without a BOM. So the Excel->CSV->Excel round trip doesn't work.

I opened a CSV with a BOM in Excel, then "save-as" defaulted to "unicode". I think it turned out to be UTF-16 but at any rate it defeated everything except Excel itself and Vim -- nothing else could interpret it.

I tried about 50 CSV files sampled from GOV.UK, and every file that contained £ failed, which suggests that the vast majority of CSV files are created using Excel.

arnau commented 5 years ago

@frankieroberto thanks for putting together the test website, really useful.

@davidread @nacnudus thanks, these results make me more confident that we should recommend no BOM.

Also, given how difficult is to open or export CSV in Excel, perhaps it would be good to provide some guidance on how to do it? Probably better if it describes other tools/methods like OpenOffice, Google Sheets, etc?

Aside, when you say "Excel" would you mind specifying the program version and OS version? And, ideally, the way you opened the CSV (e.g. drag and drop, open file).

davidread commented 5 years ago

Thanks very much, all. The BOM question is a real can of worms! Here's my summary:

For BOM:

Excel is pretty bad with UTF8 generally, but the BOM does help it somewhat

Against BOM:

R & Python libraries can read a BOM, although for the frictionless examples @nacnudus gave I still suspect this requires recent versions, and doesn't apply to python's standard csv library.
*nix command-line tools do not like BOMs.
Desktop tools like Tableau

I guess it comes down to the Excel users, and whether we see them as primary users of this machine-readable data, or just secondary. The research we did on DGU showed that Excel is by far the most used tool. I'm not clear if this was just for viewing the file (which I'd argue is just a nice-to-have for machine-readable data), or led to actual basic data work by business analysts, financial sector, managers, etc. Clearly Python/R/*nix and other stuff is what is used by technical professions - analysts / statisticians / data scientists / developers.

@arnau and others, what's your experience of users and their needs for machine-readable data?

co-cddo / open-standards

Tabular data #58

Challenge Owner

Short Description

User Need

Expected Benefits

Examples of bad tabular data:

Functional Needs

Formal specification

Implementation of the formal specification

Openness

Access to the formal specification

Versatility/flexibility of the proposed standard

End user effect of the formal specification

Maintenance of the formal specification

Related European standards