Trouble with accent or other special characters - Export encoding format

pauline-chauvet commented 1 year ago

While exporting data written with special characters in .nccsv, the encoding makes it look unpleasant for the user. for example: -Original "Observation temps réel des bouées Viking du PMZA (provisoire)\n\tAZMP Viking Buoy Observations (Provisional)" -Becomes "Observation temps r\u00e9el des bou\u00e9es Viking du PMZA (provisoire)\n\tAZMP Viking Buoy Observations (Provisional)"

Right now, the encoding is ASCII: ".nccsv - Download a NetCDF-3-like 7-bit ASCII NCCSV .csv file with COARDS/CF/ACDD metadata."

Will it be possible in the future to have it in UTF8?

BobSimons commented 1 year ago

The underlying problem is that one of the goals of nccsv was to have it be readable by spreadsheet programs like Excel. Traditionally they just supported ASCII in csv files (I don't know about now). So if I changed the nccsv specification to support UTF-8 (which would obviously be good in many ways), it would break that basic goal.

I will say: although it isn't proper, a solution is to use the ISO 8859-1 character set in the nccsv files that you create. That gives you all of the common accented characters in European languages (with the exception of the Euro character and others above Unicode #255). ERDDAP already reads .nccsv files as if they have the ISO 8859-1 character set. (There is a programmer's saying "Be strict in what you write but liberal when you read.") Doing this may cause trouble down the road (e.g., if someone ever tries to load one of these files into Excel), but you may be willing to accept that risk.

Is that a reasonable/acceptable solution for your purposes?

n-a-t-e commented 1 year ago

Hi Bob,

I don't follow the "non-proper" solution- is there a way to download .nccsv files from an erddap dataset with accented characters in the metadata, and not have it end up as ascii-escaped unicode (eg "P\u00eaches" instead of Pêches)?

Ascii-escaped unicode isn't interpreted by Excel, so it seems to be less supported than UTF-8, which can work in Excel by loading it in a certain way.

Thanks

Nate

BobSimons commented 1 year ago

Sorry, my bad. I skipped over the first line of the original question "While exporting data written with special characters in .nccsv" and answered the question as if pauline-chauvet were making files to be imported into ERDDAP. Instead, you and pauline-chauvet are interested in the exported nccsv files. But the bulk of my answer remains. A goal of nccsv was to define a csv standard that could be read into spreadsheets and traditionally that has meant that they needed to be ASCII files. Yes, you're right, the encoding system is ugly compared to the UTF-8 characters, but it worked.

That said, I see that I (or my successor) should revisit this to see if all (or most) modern spreadsheet programs can now import UTF-8 CSV files. If so, then yes, I (or my successor) will change the nccsv specification to use UTF-8 and change ERDDAP to read and write UTF-8 CSV files.

Since I no longer work for NOAA and no longer have access to Excel, you (and others) can help. Can you please make a tiny UTF-8 CSV file with ASCII, 8859-1 (e.g., accented), and UTF-8-specific (e.g., the Euro symbol \u20ac) characters, then import the csv file into various common spreadsheet programs (Excel, Google Docs, Open Office?, Libre Office?, others?) to make sure each can read the file correctly? For each program, post a message here with

The name and version number of the program
The steps needed to import the UTF-8 csv files correctly, e.g., "File : Open : ...", That would help me tremendously.

n-a-t-e commented 1 year ago

Thanks for taking a look. I tried opening this CSV file in various tools:

id,symbol,food,price
1,🐟,pêches,€4
2,🍕,pizza,¥1
3,🥦,légumes,₹2

Google sheets Imports perfectly by default
LibreOffice Community 7.2.5.2 (Mac) Imports perfectly by default
Apple Numbers 10.3.5 (Mac) Imports perfectly by default
Excel 16.68 (Mac) By default it loads the csv but doesnt parse any non-ascii characters well

To import as unicode, Create blank workbook, then go to Data->Get External Data->Text file, which opens the "Text Import Wizard":

1) Set File Origin to "Unicode (UTF-8)" 2) Set delimiter to Comma 3) Finish 4) Select "Create new Sheet"

Then it imports all characters perfectly

BobSimons commented 1 year ago

That is great! Thank you very much for doing that.

Can someone else verify that it works in Windows Excel?

thogar-computer commented 1 year ago

Tested on Excel (Microsoft® Excel® for Microsoft 365 MSO (Version 2211 Build 16.0.15831.20098) 64-bit) for windows 10 using same file as above.

We can import direct from web and excel will auto pick the UTF-8 encoding

or data can be loaded via the from text/CSV option from an blank excel sheet and encoding is auto picked again.

both result in

when directly opening the csv in excel, it does not detect the encoding correctly and therefore the data is shown incorrectly.

pauline-chauvet commented 1 year ago

Also tested on Microsoft® Excel® pour Microsoft 365 MSO (Version 2211 Build 16.0.15831.20098) 64 bits and same result than thogar-computer

Tested with google sheet online (opened directly from my google drive) It works using the "import" tool !

pauline-chauvet commented 1 year ago

And it also works with LibreOffice Calc Version: 7.4.3.2 (x64) / LibreOffice Community CPU threads: 8; OS: Windows 10.0 Build 19044;

BobSimons commented 1 year ago

Excellent! Thank you all very much for doing the tests.

Okay. I will make the changes to the NCCSV specification and to the way the ERDDAP reads and writes NCCSV files. Hopefully, it will be in the next release.

pauline-chauvet commented 1 year ago

This is great ! Merci beaucoup =)

kutoso commented 9 months ago

I was wondering if the encoding can be change from UTF-8 to UTF-8-BOM ? I believe this adjustment might help resolve the issue in Excel. As previously mentioned, when opening the CSV file directly in Excel, the encoding detection is not accurate, resulting in incorrectly displayed data. Currently, the excel import tool is required to visualize the data correctly. By employing the UTF-8-BOM encoding, the need for the import tool might be eliminated, as the data is expected to be displayed accurately without any additional steps.

BobSimons commented 9 months ago

For the uninitiated, BOM = Byte Order Mark.

I didn't know there was a problem reading the UTF-8 nccsv files into Excel. Do other people have this problem?

I didn't have Excel when I made the changes above and so relied on the tests from those users. (I still don't have Excel.)

I'm disappointed in Excel. It should be able to detect and read UTF-8 without a BOM. See the top answer for https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-with-bom which quotes the Unicode spec section 2.6 as saying "Use of a BOM is neither required nor recommended for UTF-8". See also all the mentions of "byte order" in the UTF-8 page in Wikipedia https://en.wikipedia.org/wiki/UTF-8 which agree but mention some software exceptions. Note that the JSON spec says BOM must not be used.

So I think adding the BOM is a bad idea, in general. It seems like trouble to add it just to fix the Excel use case (and probably then cause problems elsewhere), but that is a primary use case.

Comments? Does anyone else have a solution for automatically importing the nccsv files correctly in Excel?

robitaillej commented 9 months ago

Reading UTF-8 into Excel isn't necessarily a problem. However, as previously mentioned in this conversation, users must use the import wizard to load csv without BOM correctly. I understand that using a BOM can cause issues with certain software applications. Nevertheless, one of the primary objectives of the nccsv format is to ensure correct compatibility with spreadsheet programs like Excel and Google Sheets. Therefore, wouldn't specifying the BOM make sense if it facilitates the workflow for our primary users (those who rely on Excel or Google Sheets)?

BobSimons commented 9 months ago

Although I don't like BOMs and don't like going against the utf-8 file recommendations, your point about spreadsheets being a primary objective is strong. (MS/Excel are non-standard trouble yet again.)

Can you (or someone) please check: if a utf-8 nccsv file has a BOM, can it be read correctly in all of the common spreadsheet programs (see above), hopefully automatically?

n-a-t-e commented 9 months ago

I converted my test file above to UTF-8 with BOM: https://gist.github.com/n-a-t-e/005339cdb23905cc4118db57f41cfb72

I tested it in:

Google sheets

opens perfectly by default

LibreOffice Community 7.2.5.2 (Mac)

opens perfectly by default

Apple Numbers 10.3.5 (Mac)

opens perfectly by default

Excel 16.77.1 (Mac)

Opening the BOM csv file brings up an import wizard, which it doesn't seem to do for non-BOM files. It prompts for column delimiter and sadly defaults to "tab" instead of "comma". So the user would have to know to change this. Then it opens properly, looks good.
In the images below, the BOM csv forces the import wizard to be used, and looks good. The non-BOM csv isn't prompted to import, and displays garbage (unless you manually run the import wizard).

kutoso commented 9 months ago

Hi, In my case I created an csv-utf8-BOM file and I could open it with all this spreadsheet programs.

Excel (windows- version 2309)

Notepad++ v8.4.7 (64-bit)

Sublime

Bloc-notes Version22H2 v 19045.3570

OpenOffice 4.1.14 (an import wizard pop-up and then it is ok)

WordPad

Google sheet

BobSimons commented 9 months ago

@n-a-t-e, that is good news for the non-Excel imports, but it is a bummer that Excel doesn't read the unicode+BOM file automatically. That was the main goal.

@kutoso (or @n-a-t-e), why do you think your test of unicode+BOM didn't trigger the Import Wizard?

n-a-t-e commented 9 months ago

I think I should clarify my previous long message- The BOM file works great in Excel on Mac and Windows (tested both).

There is a tiny quirk, at least in the Mac version -

if you double click on the file to open it, works perfecly
if you open it via the file menu it opens to a wizard which guesses the delimiter wrong, so the user has to select "Comma". Still better than the non-BOM version by a lot

BobSimons commented 9 months ago

Thanks for clarifying. So BOM sounds like a good idea.

@ChrisPJohn, it is up to you, but it sounds to me like a good idea to:

Make a 0.1(?) increment of the nccsv standard document which strongly recommends (but doesn't require) that .nccsv files have a BOM. Please keep the existing version and also add the new one as the current version. You might include little hints like how to open the file in Excel to avoid the Import Wizard (or what to do if you see the wizard) and how to add a BOM to a file that doesn't have one (there must be a way with 'cat').
Make it so that when ERDDAP reads nccsv files, it handles files with and without a BOM.
Make it so that nccsv files created with ERDDAP have a BOM.

Thank you @kutoso, @robitaillej, and @n-a-t-e. This is a good improvement.

ChrisJohnNOAA commented 9 months ago

That sounds like a solid plan. I'll add this to the todo list.

ERDDAP / erddap

Trouble with accent or other special characters - Export encoding format #87