Open lognaturel opened 1 year ago
Do we have a sense of how many ODK users use Excel? Probably many, right? I think the dream would be to provide an option to download submission data as .xlsx such that there is a single workbook with multiple sheets rather than multiple CSV files, each of which needs to be individually imported. However, I'm not sure what options there are in Node to stream to an .xlsx file. Assuming that we're not going to pursue that route, I think it's an interesting question whether to add the BOM to CSV files.
However, it's likely that other downstream tools would then have trouble opening our CSVs.
I agree that there would likely be some issue. I think it's likely that there are downstream tools that only work with ASCII data and would have trouble with the BOM, yet are still useful to a subset of users. (I would mention odkmeta
, which only works with ASCII data, but I'm not sure how many users still use it.) That said, I would guess that most downstream tools would be able to process the BOM. If that's the case, and if many users use Excel and would find the BOM convenient, maybe that's reason to include the BOM.
However, since we know that Excel can work without the BOM — it's just less convenient — my instinct is that we should leave things as-is. If we were to make a change, I think we could do so as an option to include the BOM or not. Maybe the option even defaults to including the BOM. But we probably shouldn't just always include the BOM given that we think that downstream tools would break.
tools that only work with ASCII data and would have trouble with the BOM,
Any custom scripts that read those files would need to handle the BOM explicitly or use a library that does. I've experienced not handling the BOM and it's pretty surprising/weird. Maybe it's easy enough for a dev to look this up and find out more. I don't have a good sense of other systems that would or would not handle it. Do you have an idea of what odkmeta
would do, for example?
would guess that most downstream tools would be able to process the BOM
Citation? 😊
we probably shouldn't just always include the BOM given that we think that downstream tools would break.
Yes, that sounds right. 😢
Do you have an idea of what
odkmeta
would do, for example?
Usually the type
column is the first column of the survey
sheet. odkmeta
will look for the type
column, but won't find it in this case, because it thinks the BOM is part of the column header.
would guess that most downstream tools would be able to process the BOM
Citation? 😊
Haha emphasis on "guess"! But for example, Stata for a long time didn't support Unicode at all (that's why odkmeta
has issues). But these days it's happy to import a CSV file with Unicode. I have to think that most data analysis tools that are in widespread use are able to import Unicode somehow. Though maybe there's software that's able to import Unicode, but would still get tripped up on the BOM? Anyway, all guesses!
I'm still mulling over the idea of an optional BOM. Maybe we could include the BOM by default in downloads from Frontend, but give users a way to opt out if it's breaking? I'm not sure I've ever seen such an option in other software though. 🤔
I just lost a couple of minutes of my life to UTF-8 BOMs again. I saved a CSV from Excel and it did have the BOM. Then I tried to parse it in a Python script and the first column name wasn't found. I added encoding='utf-8-sig'
to the open
and that worked.
It's not clear to me what someone without prior knowledge of BOMs would have done. Maybe printed the contents of the parsed file, noticed \ufeff
and ended up at this Stackoverflow post? It's hard for me to gauge how much of a roadblock it would be.
I think the dream would be to provide an option to download submission data as .xlsx such that there is a single workbook with multiple sheets rather than multiple CSV files
I concur. Excel is a standardized format that solves a lot of the ambiguity and pitfalls that exist around working with CSVs. And that's when using implementations that account for some of the diversity in CSV framing and quoting dialects (and BOMs), we're not even talking about the problems with all the ad-hoc "CSV parsers" that keep springing into existence, as in "oh just split by \n
and then split by ,
right?".
Wherever I've introduced Excel downloads instead of CSV downloads, support issues just melted away.
However, I'm not sure what options there are in Node to stream to an .xlsx file
Hmmm, I think it'd be worth it to investigate the requirements and their hardness and the potential solutions.
The questions in this issue are still live for us, so I'm going to go ahead and move it to the central
repo. We've been slowly moving all user-facing issues to that repo.
I'm opening this issue so we have a record that this was considered but I'm currently leaning towards doing nothing.
CSV exports currently do not include UTF-8 byte-order-marks. In general, this is a good practice because "the Unicode Standard permits the BOM in UTF-8 but does not require or recommend its use." (https://en.wikipedia.org/wiki/Byte_order_mark)
Unfortunately, modern versions of Excel continue to open CSVs that don't have BOMs as ASCII. That regularly trips up users:
If we add the BOM, exported CSVs would open as expected in Excel when double-clicked. However, it's likely that other downstream tools would then have trouble opening our CSVs. The advantage of the current state of things is that there is one known bad behavior. With a BOM, there are likely to be various different kinds of problems that manifest differently.