DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

Excel doesn't decode multi-byte UTF-8 characters in TSV manifests #3129

Open hannes-ucsc opened 3 years ago

hannes-ucsc commented 3 years ago

Navigate to this project and click the download icon in the Metadata column. This will download a tsv file.

On Windows 10: If I open the file automatically with Excel by double-clicking the file from Windows Explorer then the utf-8 encoding of "micrometer" in the "cell_suspension.cell_morphology.cell_size_unit.text" column displays incorrectly as "μm". If I start Excel, Open File, select the "All Files" type I get the "Text Import Wizard", set "File Origin" to "65001: Unicode (utf-8)", "Finish", then the text in the "cell_suspension.cell_morphology.cell_size_unit.text" column displays correctly as "μm". So on Windows, having the file extension ".tsv" set to "always use Excel to open this type of file" will cause the bug.

hannes-ucsc commented 3 years ago

Would be helpful if ~@theathorn~ someone with Excel on macOS could try to reproduce this given the problem description on the "Re: Exchange format consultation" email thread.

theathorn commented 3 years ago

Windows Excel screenshot attached. Screenshot 2021-06-10 102803.png

theathorn commented 3 years ago

Copy cell of interest to notepad: image.png

Hex dump of that cell (via notepad): Screenshot 2021-06-10 104225.png

theathorn commented 3 years ago

Also see https://stackoverflow.com/questions/6002256/is-it-possible-to-force-excel-recognize-utf-8-csv-files-automatically

hannes-ucsc commented 3 years ago

Specifically, this answer seems to indicate that adding a UTF-8 BOM fixes the issue at least on more recent versions of Excel.

hannes-ucsc commented 3 years ago

I think we should ask someone on our team to install Office 360 — which @achave11 pointed out was available for UCSC employees — and try to reproduce the issue on Mac.

hannes-ucsc commented 3 years ago

… and also the suspected fix.

theathorn commented 3 years ago

On my work PC the file extension ".tsv" is configured to auto-open with Notepad (so I don't encounter this bug). On my home PC the file extension ".tsv" is configured to auto-open with Excel (so I do encounter this bug).

melainalegaspi commented 3 years ago

@achave11 spike to reproduce on Mac OS and confirm suspected fix.

achave11-ucsc commented 3 years ago

The bug is reproducible in Mac v10.15.6 using Microsoft Excel for Mac v16.50. Excell-v16 49

The suspected fix (adding UTF-8 BOM) corrects the displayed format. Screen Shot 2021-06-16 at 11 10 34 AM But it's done at the expense of the column organization. When the BOM is introduce all the columns in the TSV get stuffed into the first column.

melainalegaspi commented 3 years ago

@hannes-ucsc : "@achave11 to provide hexdump of first two lines in TSV after BOM header insertion."

achave11-ucsc commented 3 years ago

@hannes-ucsc: "The findings are Excel doesn't support UTF-8 encoded TSV, only UTF-8 encoded CSV. The longterm solution is to offer an alternative manifest format for the latter, but that requires more elaborate handling of commas and quotes. In the short term we need to advice our users to import the TSV via File - Import rather than double clicking the file, followed by selecting UTF-8 encoding in the import dialog."