Consider adding UTF-8 BOM to CSV exports

getodk / central

ODK Central is a server that is easy to use, very fast, and stuffed with features that make data collection easier. Contribute and make the world a better place! ✨🗄✨

https://docs.getodk.org/central-intro/

Apache License 2.0

125 stars 155 forks source link

Consider adding UTF-8 BOM to CSV exports #752

Open lognaturel opened 1 year ago

lognaturel commented 1 year ago

I'm opening this issue so we have a record that this was considered but I'm currently leaning towards doing nothing.

CSV exports currently do not include UTF-8 byte-order-marks. In general, this is a good practice because "the Unicode Standard permits the BOM in UTF-8 but does not require or recommend its use." (https://en.wikipedia.org/wiki/Byte_order_mark)

Unfortunately, modern versions of Excel continue to open CSVs that don't have BOMs as ASCII. That regularly trips up users:

They only have a few UTF-8 characters so don't notice and end up with bad chars in their analysis
They notice bad chars but are stuck on what to do next
They somehow find the tip at the bottom of this docs section and are annoyed by that process

If we add the BOM, exported CSVs would open as expected in Excel when double-clicked. However, it's likely that other downstream tools would then have trouble opening our CSVs. The advantage of the current state of things is that there is one known bad behavior. With a BOM, there are likely to be various different kinds of problems that manifest differently.

matthew-white commented 1 year ago

Do we have a sense of how many ODK users use Excel? Probably many, right? I think the dream would be to provide an option to download submission data as .xlsx such that there is a single workbook with multiple sheets rather than multiple CSV files, each of which needs to be individually imported. However, I'm not sure what options there are in Node to stream to an .xlsx file. Assuming that we're not going to pursue that route, I think it's an interesting question whether to add the BOM to CSV files.

However, it's likely that other downstream tools would then have trouble opening our CSVs.

I agree that there would likely be some issue. I think it's likely that there are downstream tools that only work with ASCII data and would have trouble with the BOM, yet are still useful to a subset of users. (I would mention odkmeta, which only works with ASCII data, but I'm not sure how many users still use it.) That said, I would guess that most downstream tools would be able to process the BOM. If that's the case, and if many users use Excel and would find the BOM convenient, maybe that's reason to include the BOM.

However, since we know that Excel can work without the BOM — it's just less convenient — my instinct is that we should leave things as-is. If we were to make a change, I think we could do so as an option to include the BOM or not. Maybe the option even defaults to including the BOM. But we probably shouldn't just always include the BOM given that we think that downstream tools would break.

lognaturel commented 1 year ago

tools that only work with ASCII data and would have trouble with the BOM,

Any custom scripts that read those files would need to handle the BOM explicitly or use a library that does. I've experienced not handling the BOM and it's pretty surprising/weird. Maybe it's easy enough for a dev to look this up and find out more. I don't have a good sense of other systems that would or would not handle it. Do you have an idea of what odkmeta would do, for example?

would guess that most downstream tools would be able to process the BOM

Citation? 😊

we probably shouldn't just always include the BOM given that we think that downstream tools would break.

Yes, that sounds right. 😢

matthew-white commented 1 year ago

Do you have an idea of what odkmeta would do, for example?

Usually the type column is the first column of the survey sheet. odkmeta will look for the type column, but won't find it in this case, because it thinks the BOM is part of the column header.

would guess that most downstream tools would be able to process the BOM

Citation? 😊

Haha emphasis on "guess"! But for example, Stata for a long time didn't support Unicode at all (that's why odkmeta has issues). But these days it's happy to import a CSV file with Unicode. I have to think that most data analysis tools that are in widespread use are able to import Unicode somehow. Though maybe there's software that's able to import Unicode, but would still get tripped up on the BOM? Anyway, all guesses!

I'm still mulling over the idea of an optional BOM. Maybe we could include the BOM by default in downloads from Frontend, but give users a way to opt out if it's breaking? I'm not sure I've ever seen such an option in other software though. 🤔

lognaturel commented 1 year ago

I just lost a couple of minutes of my life to UTF-8 BOMs again. I saved a CSV from Excel and it did have the BOM. Then I tried to parse it in a Python script and the first column name wasn't found. I added encoding='utf-8-sig' to the open and that worked.

It's not clear to me what someone without prior knowledge of BOMs would have done. Maybe printed the contents of the parsed file, noticed \ufeff and ended up at this Stackoverflow post? It's hard for me to gauge how much of a roadblock it would be.

brontolosone commented 1 month ago

I think the dream would be to provide an option to download submission data as .xlsx such that there is a single workbook with multiple sheets rather than multiple CSV files

I concur. Excel is a standardized format that solves a lot of the ambiguity and pitfalls that exist around working with CSVs. And that's when using implementations that account for some of the diversity in CSV framing and quoting dialects (and BOMs), we're not even talking about the problems with all the ad-hoc "CSV parsers" that keep springing into existence, as in "oh just split by \n and then split by , right?". Wherever I've introduced Excel downloads instead of CSV downloads, support issues just melted away.

However, I'm not sure what options there are in Node to stream to an .xlsx file

Hmmm, I think it'd be worth it to investigate the requirements and their hardness and the potential solutions.

matthew-white commented 4 weeks ago

The questions in this issue are still live for us, so I'm going to go ahead and move it to the central repo. We've been slowly moving all user-facing issues to that repo.