alphagov / data-standards-authority

Collaboration space for working on data standards and guidance for the DSA
https://alphagov.github.io/data-standards-authority/
Other
24 stars 11 forks source link

Request For Feedback on Additional CSV Guidance #1

Closed gheye closed 3 years ago

gheye commented 4 years ago

We have now published this guidance on GOV.UK - Using CSV file format

If you would like to provide further feedback, please feel free to do so using the feedback tools at the bottom of the published guidance or by contacting us on GitHub or by email at data-standards-authority@digital.cabinet-office.gov.uk


HI,

We appreciate your input on this. The version below which is a copy of this document will be going to the DSA Steering board for approval next week. We then hope this can be published on GOV.UK.

Please note the google doc is the master and I have taken care to ensure the below matches it, however, human error is always possible.

——————-

Using CSV file format

Consistent and standards-based comma-separated values (CSV) files help make data more effective and easier to share across government.

The CSV file format is widely used in government for sharing and receiving data. This guidance is for government employees who publish data. It shares the standard and best practice for creating consistent and interoperable CSV files.

The government recommends the RFC 4180 open standard for CSV files. You can read how to use this standard in the Tabular data standard profile.

The standard aims to reduce common problems and help with data management. The standard and this guidance focus on the manual process and do not refer to using automated tools, libraries and languages. You can use some of the principles in this document in automated environments.

Use CSVs to meet user needs

When creating a CSV file you should consider the needs of government users which include statisticians, data collection agencies, developers, data engineers creating automatic extracts and general users.

These users need a consistent CSV file format so they can:

You should also consider whether your CSV will be read by machines or by humans, and understand how this affects the accessibility needs of your file.

The main benefits of using CSV include the ability to:

When not to use CSV

It may not be appropriate to use the CSV file format if you:

It is important to note that exporting a spreadsheet to CSV format will:

Setting up your CSV file

Before you produce a CSV file, make sure the data is about one subject area. Your columns should relate to each other and not mix unrelated information. For example, you should not put train timetables and types of battery in the same CSV file.

When manually setting up a CSV file, or setting up an automated process, you should:

You should also apply metadata to your CSV files. You can learn more by reading the:

Managing large CSV files

The definition of a large CSV file will depend on the department and tool using the file. As a general guide, a CSV file size of a few hundred MB is large for Excel while a CSV file size of a few GB is large for an API or a data management tool. Keep this in mind when choosing which tools or applications to use.

When using large CSV files in Excel, you must make sure the whole file loads properly. If you are using older versions of Excel (97-2003) you should be aware that they have a low row limit of 65,536 rows. Microsoft has information about Excel specifications and limits.

When CSV files become large, data suppliers might compress them into a ZIP file. If you use a ZIP file, you should add metadata to describe your CSV and add instructions on how to open the ZIP file without causing errors to the contents of the file.

Validating your CSV files

To make sure your CSV files remain accurate and consistent, you should:

Moving or storing data in CSV file format

When creating data sets, you should consider whether it is likely your data set will need to be transferred to or stored in different places. Using a standardised data structure makes moving data easier and helps keep it intact.

For example, the National Archives are legally required to store a range of data sets from government departments. It is easier for users to search and process these data sets if their format is standardised.

Working with multiple database tables

You should not compress multiple database tables into one CSV file. Instead, you could use a format such as CSV on the Web to express the relationships between several separate CSV files, or use a more appropriate file format.

There may be instances where two tables hold very similar data, such as first name and last name. In cases like this, it might make sense to merge them into one CSV.

DavidUnderdown commented 4 years ago

An anti-recommendation for using Excel with CSV. It is still extremely hard to persuade it to read and write UTF-8 properly, it will tend to default to Windows-1252. With the default open action you have no control over the cell formatting that is applied, dates will be silently changed to Excel dates in whatever your default format is (though ironically Excel does not recognise ISO8601 formatted dates as dates). This has been fairly widely publicised as geneticists realised that this was happening to gene symbols, causing problems analysis. They opted to change gene symbols to things that Excel wouldn't think are dates earlier this year https://www.theregister.com/2020/08/06/excel_gene_names/. I've also come across it changing a surname, True, into Boolean TRUE. It makes these changes silently.

LibreOffice Calc on the other hand allows you to set the text encoding on opening, and control the formatting of each column.

isichei commented 4 years ago

Although in the google doc there is a link to gov guidance on using ISO formats for dates/times I think that should be more explicit. Also if you are doing multiple format guides it might be worth having a tier list of when to use what data format (i.e. we try to avoid CSV internally as much as possible)

JKaranka commented 4 years ago

Regarding comment about Excel: sadly, tons of CSV files will be opened in Excel (not just Python, SAS, R, LibreOffice, etc.). Might be worth testing with Excel and letting users know in a readme or similar whether the CSV has been tested on it, and whether it would import correctly.

jl5000 commented 4 years ago

I wonder if we should also recommend that column headers should use underscores instead of spaces, avoid 'special' characters like '%', and should not encode missing values to anything else other than a missing value (I know some CSV writing functions write 'NA' by default).

dcherrie-dstl commented 4 years ago

I wonder if we should also recommend that column headers should use underscores instead of spaces, avoid 'special' characters like '%', and should not encode missing values to anything else other than a missing value (I know some CSV writing functions write 'NA' by default).

A repetitive task when parsing CSV in python is standardising header names, for me this involves...

  1. converting to "snake_case" (vs CamelCase) and
  2. dealing with special characters e.g. round brackets, ampersand etc.

For example "Orbital period (mins)" would become "orbital_period_mins" or perhaps "orbital_period"

Multiple strategies can be devised, but consistency is key.

gheye commented 4 years ago

Although in the google doc there is a link to gov guidance on using ISO formats for dates/times I think that should be more explicit. Also if you are doing multiple format guides it might be worth having a tier list of when to use what data format (i.e. we try to avoid CSV internally as much as possible)

Are you able to provide an example please? Thank you.

gheye commented 4 years ago

I wonder if we should also recommend that column headers should use underscores instead of spaces, avoid 'special' characters like '%', and should not encode missing values to anything else other than a missing value (I know some CSV writing functions write 'NA' by default).

A repetitive task when parsing CSV in python is standardising header names, for me this involves...

  1. converting to "snake_case" (vs CamelCase) and
  2. dealing with special characters e.g. round brackets, ampersand etc.

For example "Orbital period (mins)" would become "orbital_period_mins" or perhaps "orbital_period"

Multiple strategies can be devised, but consistency is key.

Thanks both. I wonder whether our guidance should try and nudge people towards camel case of using underscores. I will write something and add it in.

ben-sagar commented 4 years ago

This guidance says:

ideally save the file in UTF 8

The referenced Tabular data standard says:

RFC 4180 says you can use any encoding for non-ASCII characters, but the UK Government specifies UTF-8 for cross-platform character encoding.

It would be good to clarify whether this is mandated or optional.

ben-sagar commented 4 years ago

The Tabular data standard says:

RFC 4180 requires CRLF. You can also use “\n” (LF - Unix-style) for line breaks which is in line with W3C Best Practice.

I would welcome some clarification on this point in the guidance, or maybe an update to the data standard.

RFC 4180 allows any combination of CR and LF inside field values but mandates CRLF as a record separator.

This clarity makes the standard easier to test for in software and identify potential errors in files.

A CSV that contains both LF and CRLF as record separators is more likely to indicate some sort of error in processing than be deliberate.

davidread commented 4 years ago

The Tabular data standard says:

RFC 4180 requires CRLF. You can also use “\n” (LF - Unix-style) for line breaks which is in line with W3C Best Practice.

@bensagar-ea You're right, it's a bit ambiguous, and that's probably my fault. It's talking about line endings at the end of a record, rather than within a field.

davidread commented 4 years ago

There are some suggestions from @jl5000 and @dcherrie-dstl to only use simple column headers. This feels right, but to make a case for it, can you point to any specific examples of tools where spaces or symbols cause an issue?

I suppose Excel has problems with anything apart from ASCII 7-bit, because its default decoder is latin7 or something, rather than UTF8. However it copes with spaces and % fine.

dcherrie-dstl commented 4 years ago

There are some suggestions from @jl5000 and @dcherrie-dstl to only use simple column headers. This feels right, but to make a case for it, can you point to any specific examples of tools where spaces or symbols cause an issue?

I suppose Excel has problems with anything apart from ASCII 7-bit, because its default decoder is latin7 or something, rather than UTF8. However it copes with spaces and % fine.

A short and specific motivating example https://medium.com/@chaimgluck1/working-with-pandas-fixing-messy-column-names-42a54a6659cd

davidread commented 4 years ago

As an aside, I wonder if there is a need for a tool that converts .XLS(X) files to CSV? It would do a better job than Excel's CSV exporter, for these reasons:

Secondly, a checker / linter for CSV would be good for flagging common errors, and promote the sort of guidance that is forming here. csvlint already has lots of good checks:

If these two tools were integrated into GOV.UK's publishing workflow (Whitehall), it might raise quality in CSVs en masse. My impression is that lots of transparency data, for example, arrives there through a number of manual processes, by people who are not data specialists and are not reading this sort of guidance. So tools will help them convert their Excel file to CSV, and nudge them to meet the guidance, could really help.

DavidUnderdown commented 4 years ago

We developed CSV Schema and CSV Validator at The National Archives (this started before CSV on the Web got under way and is for a slightly different use case).

gheye commented 4 years ago

Comments received from the Home Office:

ajtucker commented 4 years ago

Comments received from the Home Office:

  • Even though the standard allows it, I would make it a recommendation that data does not contain line feeds or carriage return characters, as this is a frequent source of broken data pipelines.

Not sure what you mean here? That the cells shouldn't have LF/CR in them? RFC4180 does allow this, but I'd agree that it's not good practice.

One thing I was mildly surprised about is that CSV files (and all text/* media types) should use carriage return + line feed to end the line, as per the old DOS text standards, as opposed to Unix.

ajtucker commented 4 years ago

Secondly, a checker / linter for CSV would be good for flagging common errors, and promote the sort of guidance that is forming here. csvlint already has lots of good checks:

We (ONS) have been using csvlint successfully in pipelines where we use the standard CSV-W for validation. We've been using a slightly older forked version of the Ruby library for this along with JSON files to encapsulate the schema of the CSV we want to validate. This lets us:

The Ruby library is a bit slow and annoying to install. We've packaged it in Docker so that we can use it in some automated pipelines: https://github.com/GSS-Cogs/csvlint-docker

RickMoynihan commented 4 years ago

Hi all,

Disclosure: I don't work inside government, but do work for Swirrl a supplier of data publishing infrastructure for government, through that work I have a longstanding collaboration with @ajtucker at ONS and others working with government data (mostly open).

Anyway, I thought I'd just pop in to provide a comment on the CSV encoding issue's as I see them.

ideally save the file in UTF 8 which fully supports non-ASCII characters so the file remains accurate when you share it.

Firstly I fully agree that UTF-8 encoding is the most sensible option, and one that I would very much like to see mandated unless there is a good reason not to. In the case of an exception then the csvw recommendations are sensible:

CSV files should be encoded using UTF-8, and should be in Unicode Normal Form C as defined in [UAX15]. If a CSV file is not encoded using UTF-8, the encoding should be specified through the charset parameter in the Content-Type header:

Content-Type: text/csv;charset=ISO-8859-1

However as others like @DavidUnderdown have mentioned the elephant in the room here is absolutely Excel. As I see it the desire for CSV files to open properly in Excel is a divergent requirement, that's largely irreconcilable with other open data goals, at least without presenting an Excel specific option, or exposing users to the complexities of encoding, and instructions on how to open the file (and even then there I think there will be problems).

The problem is that Excel will mangle any UTF-8 csv file it opens via either File > open or the import csv process (even if a UTF-8 encoding is specified). The most accurate report into this behaviour I have seen is in the CSVW spec here which might offer some leads to a resolution. In particular their observations around Excel's behaviour regarding UTF-16 are interesting; however their investigations appear to have neglected the fact that there are two variants of UTF-16 in different endians, and I don't know which they used. They do however provide their test files here for anyone who may wish to reproduce these experiments.

Regarding serving users a Windows-1252 encoded CSV, depending on your use case, I'm not convinced that is an ideal option. My main concern here is that I strongly suspect (though haven't confirmed) that Excel only defaults to that encoding for users using english or latin languages. So I suspect users with a different locale set, will actually be opening it in a different codepage too. There may of course be some overlap between codes and characters in those codepages, but I've not verified that is the case, if anyone wanted to investigate this, a starting point might be here.

The TLDR; is that my general feeling on the matter is that providing Excel users a good CSV experience is a lost cause. I've had the misfortune to dig into it and probably deeper than most, however it's a ridiculuosly complex area and I may well have missed something. Hopefully someone here finds this useful.

arnau commented 4 years ago

The Tabular data standard says:

RFC 4180 requires CRLF. You can also use “\n” (LF - Unix-style) for line breaks which is in line with W3C Best Practice.

I would welcome some clarification on this point in the guidance, or maybe an update to the data standard.

RFC 4180 allows any combination of CR and LF inside field values but mandates CRLF as a record separator.

This clarity makes the standard easier to test for in software and identify potential errors in files.

A CSV that contains both LF and CRLF as record separators is more likely to indicate some sort of error in processing than be deliberate.

@gheye I think this touches on a key point: Guidance for producers and consumers has to be different. Eg. Guidance on line breaks should be strict for producers and flexible for consumers.

gheye commented 4 years ago

Feedback from DWP via email:

gheye commented 4 years ago

Although in the google doc there is a link to gov guidance on using ISO formats for dates/times I think that should be more explicit. Also if you are doing multiple format guides it might be worth having a tier list of when to use what data format (i.e. we try to avoid CSV internally as much as possible)

Hi,

I don't think the CSV guidance should extend date time guidance. If you would like to change the guidance that is currently on GOV.UK around dates please use this email data-standards-authority@digital.cabinet-office.gov.uk with perhaps an example of how you would like that guidance changing.

Thank you.

gheye commented 4 years ago

An anti-recommendation for using Excel with CSV. It is still extremely hard to persuade it to read and write UTF-8 properly, it will tend to default to Windows-1252. With the default open action you have no control over the cell formatting that is applied, dates will be silently changed to Excel dates in whatever your default format is (though ironically Excel does not recognise ISO8601 formatted dates as dates). This has been fairly widely publicised as geneticists realised that this was happening to gene symbols, causing problems analysis. They opted to change gene symbols to things that Excel wouldn't think are dates earlier this year https://www.theregister.com/2020/08/06/excel_gene_names/. I've also come across it changing a surname, True, into Boolean TRUE. It makes these changes silently.

LibreOffice Calc on the other hand allows you to set the text encoding on opening, and control the formatting of each column.

Hi David,

Thanks for your comment. Perhaps we can point out some known issues with using excel, and some things that should be checked. It is still a commonly used application in many departments.

jl5000 commented 4 years ago

There is much of value in this paper on Excel good practices, and the CSV guidance produced in this issue could augment what is in this paper:

https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989

gheye commented 4 years ago

There is much of value in this paper on Excel good practices, and the CSV guidance produced in this issue could augment what is in this paper:

https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989

Hi,

This guidance is specific for CSV. We are developing guidance for spreadsheets when referencing this guidance maybe more appropriate. I think, given the length of the article and it only reference CSV at the end then we cannot use it in this instance, however, there are certainly some good ideas in it which may make sense to develop.

gheye commented 4 years ago

I wonder if we should also recommend that column headers should use underscores instead of spaces, avoid 'special' characters like '%', and should not encode missing values to anything else other than a missing value (I know some CSV writing functions write 'NA' by default).

A repetitive task when parsing CSV in python is standardising header names, for me this involves...

  1. converting to "snake_case" (vs CamelCase) and
  2. dealing with special characters e.g. round brackets, ampersand etc.

For example "Orbital period (mins)" would become "orbital_period_mins" or perhaps "orbital_period"

Multiple strategies can be devised, but consistency is key.

@dcherrie-dstl Thank you for this. I have updated the guidance to include and encourage keeping column headers as simple as possible and used your example.

jl5000 commented 4 years ago

There is much of value in this paper on Excel good practices, and the CSV guidance produced in this issue could augment what is in this paper: https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989

Hi,

This guidance is specific for CSV. We are developing guidance for spreadsheets when referencing this guidance maybe more appropriate. I think, given the length of the article and it only reference CSV at the end then we cannot use it in this instance, however, there are certainly some good ideas in it which may sense to develop.

Personally, I think there is so much overlap I don't see too much of a distinction. The paper even recommends saving in text format, which almost completes the convergence.

gheye commented 4 years ago

The Tabular data standard says:

RFC 4180 requires CRLF. You can also use “\n” (LF - Unix-style) for line breaks which is in line with W3C Best Practice.

@bensagar-ea You're right, it's a bit ambiguous, and that's probably my fault. It's talking about line endings at the end of a record, rather than within a field.

@davidread @bensagar-ea we have requested an update to the tabular data standard for this.

gheye commented 4 years ago

@RickMoynihan @davidread @JKaranka @DavidUnderdown i have updated the guidance to point to MSFT and the information provide on opening CSV with excel. I have also mentioned it should be tested. I think that is far as we can go in the MSFT excel conversation. If you think there should be further additions then please let me know.

gheye commented 4 years ago

@DavidUnderdown @ajtucker Thanks for your comments regarding tools. We are considering the right approach for adding this into the guidance. I will update you when we have progressed this.

gheye commented 4 years ago

Comments received from the Home Office:

Reply and changes:

I would repeat the comment from the tabular data page as guidance, specifically recommending the use of Get Data->From text/csv to import data into Excel to allow control over data types.

Even though the standard allows it, I would make it a recommendation that data does not contain line feeds or carriage return characters, as this is a frequent source of broken data pipelines. I have added the following comment into the text: RFC allows line feed and carriage return characters, however, ideally these should not exist in your data set. They are a frequent source of data import issues

Rephrase “ If they conform to recommended standards, ” as “ If CSV files conform to recommended standards, ”

gheye commented 4 years ago

Feedback from DWP via email:

  • I am perhaps missing the context – but it would be useful to know who the audience to this guidance is likely to be, what scenarios it would help them be and what outcomes we are expecting this guidance to achieve.

I have updated the first two paragraphs and also include my introduction. Please see attachment.

  • Since there are also other scenarios and mechanisms to share data (e.g. Unstructured data items, Events, Real time) – it may be worth setting out scenarios where CSV might be appropriate and where it might not be. This might simply include some examples.

I have created a section for when it is not appropriate. It is so widely used I would rather focus on that. If you would like to extend that further then please let me know.

  • I think mention or consideration of lineage might be a good idea to ensure the guidance covers this key aspect of data sharing. Of course we should be promoting the sharing of data consistently, however – we should also promote the tracking of lineage consistently.

This guidance is just for CSV. We have a team looking at metadata which is where the question of lineage should sit. Would you like me to put you in touch with them?

wangwDIT commented 4 years ago

Hi there,

Some comments from statistical colleagues in my Dept:

gheye commented 4 years ago

@wangwDIT appreciate your comments. Could you email me at data-standards-authority@digital.cabinet-office.gov.uk, and I will arrange a call to go through your points.

Thanks for helping.

gheye commented 4 years ago

@wangwDIT, @davidread , @DavidUnderdown , @ajtucker , @RickMoynihan, @bensagar-ea, @jl5000,@dcherrie-dstl, @isichei

We also have a questionnaire open in terms of the guidance we should write for spreadsheets. We appreciate your input. It is available here.

michalc commented 4 years ago

I think I would probably want

you should not try and compress multiple database tables into one CSV file. Several CSV files could be used but is likely another format will more easily represent both the data and relationships

to be a bit more clear on what is/isn't encouraged.

In more detail:

In terms of when to split data into separate tables... I think there should be guidance on this. I would consider mentioning https://vita.had.co.nz/papers/tidy-data.pdf ? In terms of satisfying

have an interoperable dataset they can quickly use for data analysis

then tidy data, I argue, seems like a reasonable thing to encourage.

gheye commented 4 years ago

@michalc thanks for your comments. i had a call with your colleagues on Friday and we have updated the guidance accordingly. We have added items around file size. Following internal feedback i have also added some benefits of CSVs.

In terms of your feedback.

I am hoping the tech writers improve how we have written about database tables.

I think the structured lists is for another piece of guidance and not for CSV.

Please note we are developing other guidance on other filetypes which I hope should help with using another format comment.

Please let me know if you would like a call and we can go through this in more detail.

michalc commented 4 years ago

@gheye thanks for your response

I am hoping the tech writers improve how we have written about database tables

Ah cool

I think the structured lists is for another piece of guidance and not for CSV.

Ah a shame... even since posting my comment I had another discussion with a colleague who wanted to put a list of items in a single value in a CSV. It comes up a lot in my world somehow!

Please note we are developing other guidance on other filetypes which I hope should help with using another format comment.

Sounds good

Please let me know if you would like a call and we can go through this in more detail.

Actually... yes? That would be good. I'm on X-Gov Slack if that's a handy way to set things up?

michalc commented 4 years ago

it should not be used for the development of REST APIs

Also: on this, can this be clarified? As written... this does sort of suggest that if CSVs are provided, they should not be provided via a REST API?

(I ask because I'm involved in developing what could be described as a RESTful API, that supplies data in multiple formats, and one of those is CSV...)

arnau commented 4 years ago

@michalc although we don't directly provide any guidance on value formatting, we recommend CSVW in Using metadata to describe CSV data which allows you to define multi-value definitions and has defaults that might be of interest.

gheye commented 3 years ago

All the version that is going to the steering board next week is published above.

I appreciate your time and input on this.

gheye commented 3 years ago

Hi All, @DavidUnderdown  @isichei  @JKaranka, @jl5000, @dcherrie-dstl, @bensagar-ea @davidread @ajtucker @RickMoynihan @wangwDIT  @michalc

Thank you all for your feedback and engaging with us over the last six months. Your feedback has been critical in shaping how we develop this guidance and improving it immeasurably.

The latest version is published above and we aim to have this up on GOV.UK in March.

We have two workshops planned on Thursday January 28th (3:30pm to 5:00pm) and Thursday 4th February (3:30pm to 5:00pm) where we will discuss the guidance and welcome feedback. If you would like to attend either of these sessions please email: data-standards-authority@digital.cabinet-office.gov.uk

If this don't work then we are more than happy to have a call with you.

gheye commented 3 years ago

HI All,

I would like to thank you all for your input. It has been invaluable in helping us develop our guidance. The guidance is now published on GOV.UK. We aim to iterate the guidance and are maintaining a backlog, so any additions please let us know. Either on slack, GItHub or through our email: data-standards-authority@digital.cabinet-office.gov.uk

We also have proposed spreadsheet guidance you might be interested in.

gheye commented 3 years ago

Hi @DavidUnderdown @isichei @JKaranka, @jl5000, @dcherrie-dstl, @bensagar-ea @davidread @ajtucker @RickMoynihan @wangwDIT @michalc

As we develop our spreadsheet guidance we have drop in sessions next week to gather feedback from the community. Be great if you could attend.

Event 1 Event 2