ERDDAP / erddap

ERDDAP is a scientific data server that gives users a simple, consistent way to download subsets of gridded and tabular scientific datasets in common file formats and make graphs and maps. ERDDAP is a Free and Open Source (Apache and Apache-like) Java Servlet from NOAA NMFS SWFSC Environmental Research Division (ERD).
Creative Commons Zero v1.0 Universal
78 stars 57 forks source link

Trouble with accent or other special characters - Export encoding format #87

Open pauline-chauvet opened 1 year ago

pauline-chauvet commented 1 year ago

While exporting data written with special characters in .nccsv, the encoding makes it look unpleasant for the user. for example: -Original "Observation temps réel des bouées Viking du PMZA (provisoire)\n\tAZMP Viking Buoy Observations (Provisional)" -Becomes "Observation temps r\u00e9el des bou\u00e9es Viking du PMZA (provisoire)\n\tAZMP Viking Buoy Observations (Provisional)"

Right now, the encoding is ASCII: ".nccsv - Download a NetCDF-3-like 7-bit ASCII NCCSV .csv file with COARDS/CF/ACDD metadata."

Will it be possible in the future to have it in UTF8?

BobSimons commented 1 year ago

The underlying problem is that one of the goals of nccsv was to have it be readable by spreadsheet programs like Excel. Traditionally they just supported ASCII in csv files (I don't know about now). So if I changed the nccsv specification to support UTF-8 (which would obviously be good in many ways), it would break that basic goal.

I will say: although it isn't proper, a solution is to use the ISO 8859-1 character set in the nccsv files that you create. That gives you all of the common accented characters in European languages (with the exception of the Euro character and others above Unicode #255). ERDDAP already reads .nccsv files as if they have the ISO 8859-1 character set. (There is a programmer's saying "Be strict in what you write but liberal when you read.") Doing this may cause trouble down the road (e.g., if someone ever tries to load one of these files into Excel), but you may be willing to accept that risk.

Is that a reasonable/acceptable solution for your purposes?

n-a-t-e commented 1 year ago

Hi Bob,

I don't follow the "non-proper" solution- is there a way to download .nccsv files from an erddap dataset with accented characters in the metadata, and not have it end up as ascii-escaped unicode (eg "P\u00eaches" instead of Pêches)?

Ascii-escaped unicode isn't interpreted by Excel, so it seems to be less supported than UTF-8, which can work in Excel by loading it in a certain way.

Thanks

Nate

BobSimons commented 1 year ago

Sorry, my bad. I skipped over the first line of the original question "While exporting data written with special characters in .nccsv" and answered the question as if pauline-chauvet were making files to be imported into ERDDAP. Instead, you and pauline-chauvet are interested in the exported nccsv files. But the bulk of my answer remains. A goal of nccsv was to define a csv standard that could be read into spreadsheets and traditionally that has meant that they needed to be ASCII files. Yes, you're right, the encoding system is ugly compared to the UTF-8 characters, but it worked.

That said, I see that I (or my successor) should revisit this to see if all (or most) modern spreadsheet programs can now import UTF-8 CSV files. If so, then yes, I (or my successor) will change the nccsv specification to use UTF-8 and change ERDDAP to read and write UTF-8 CSV files.

Since I no longer work for NOAA and no longer have access to Excel, you (and others) can help. Can you please make a tiny UTF-8 CSV file with ASCII, 8859-1 (e.g., accented), and UTF-8-specific (e.g., the Euro symbol \u20ac) characters, then import the csv file into various common spreadsheet programs (Excel, Google Docs, Open Office?, Libre Office?, others?) to make sure each can read the file correctly? For each program, post a message here with

n-a-t-e commented 1 year ago

Thanks for taking a look. I tried opening this CSV file in various tools:

id,symbol,food,price
1,🐟,pêches,€4
2,🍕,pizza,¥1
3,🥦,légumes,₹2
BobSimons commented 1 year ago

That is great! Thank you very much for doing that.

Can someone else verify that it works in Windows Excel?

thogar-computer commented 1 year ago

Tested on Excel (Microsoft® Excel® for Microsoft 365 MSO (Version 2211 Build 16.0.15831.20098) 64-bit) for windows 10 using same file as above.

We can import direct from web and excel will auto pick the UTF-8 encoding image

or data can be loaded via the from text/CSV option from an blank excel sheet and encoding is auto picked again. image

both result in image

when directly opening the csv in excel, it does not detect the encoding correctly and therefore the data is shown incorrectly.

pauline-chauvet commented 1 year ago

Also tested on Microsoft® Excel® pour Microsoft 365 MSO (Version 2211 Build 16.0.15831.20098) 64 bits and same result than thogar-computer image

Tested with google sheet online (opened directly from my google drive) It works using the "import" tool ! image

pauline-chauvet commented 1 year ago

And it also works with LibreOffice Calc Version: 7.4.3.2 (x64) / LibreOffice Community CPU threads: 8; OS: Windows 10.0 Build 19044;

image

BobSimons commented 1 year ago

Excellent! Thank you all very much for doing the tests.

Okay. I will make the changes to the NCCSV specification and to the way the ERDDAP reads and writes NCCSV files. Hopefully, it will be in the next release.

pauline-chauvet commented 1 year ago

This is great ! Merci beaucoup =)

kutoso commented 9 months ago

I was wondering if the encoding can be change from UTF-8 to UTF-8-BOM ? I believe this adjustment might help resolve the issue in Excel. As previously mentioned, when opening the CSV file directly in Excel, the encoding detection is not accurate, resulting in incorrectly displayed data. Currently, the excel import tool is required to visualize the data correctly. By employing the UTF-8-BOM encoding, the need for the import tool might be eliminated, as the data is expected to be displayed accurately without any additional steps.

BobSimons commented 9 months ago

For the uninitiated, BOM = Byte Order Mark.

I didn't know there was a problem reading the UTF-8 nccsv files into Excel. Do other people have this problem?

I didn't have Excel when I made the changes above and so relied on the tests from those users. (I still don't have Excel.)

I'm disappointed in Excel. It should be able to detect and read UTF-8 without a BOM. See the top answer for https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-with-bom which quotes the Unicode spec section 2.6 as saying "Use of a BOM is neither required nor recommended for UTF-8". See also all the mentions of "byte order" in the UTF-8 page in Wikipedia https://en.wikipedia.org/wiki/UTF-8 which agree but mention some software exceptions. Note that the JSON spec says BOM must not be used.

So I think adding the BOM is a bad idea, in general. It seems like trouble to add it just to fix the Excel use case (and probably then cause problems elsewhere), but that is a primary use case.

Comments? Does anyone else have a solution for automatically importing the nccsv files correctly in Excel?

robitaillej commented 9 months ago

Reading UTF-8 into Excel isn't necessarily a problem. However, as previously mentioned in this conversation, users must use the import wizard to load csv without BOM correctly. I understand that using a BOM can cause issues with certain software applications. Nevertheless, one of the primary objectives of the nccsv format is to ensure correct compatibility with spreadsheet programs like Excel and Google Sheets. Therefore, wouldn't specifying the BOM make sense if it facilitates the workflow for our primary users (those who rely on Excel or Google Sheets)?

BobSimons commented 9 months ago

Although I don't like BOMs and don't like going against the utf-8 file recommendations, your point about spreadsheets being a primary objective is strong. (MS/Excel are non-standard trouble yet again.)

Can you (or someone) please check: if a utf-8 nccsv file has a BOM, can it be read correctly in all of the common spreadsheet programs (see above), hopefully automatically?

n-a-t-e commented 9 months ago

I converted my test file above to UTF-8 with BOM: https://gist.github.com/n-a-t-e/005339cdb23905cc4118db57f41cfb72

I tested it in:

Google sheets

LibreOffice Community 7.2.5.2 (Mac)

Apple Numbers 10.3.5 (Mac)

Excel 16.77.1 (Mac)

kutoso commented 9 months ago

Hi, In my case I created an csv-utf8-BOM file and I could open it with all this spreadsheet programs.

Excel (windows- version 2309) image

Notepad++ v8.4.7 (64-bit) image

Sublime image

Bloc-notes Version22H2 v 19045.3570 image

OpenOffice 4.1.14 (an import wizard pop-up and then it is ok) image

WordPad image

Google sheet image

BobSimons commented 9 months ago

@n-a-t-e, that is good news for the non-Excel imports, but it is a bummer that Excel doesn't read the unicode+BOM file automatically. That was the main goal.

@kutoso (or @n-a-t-e), why do you think your test of unicode+BOM didn't trigger the Import Wizard?

n-a-t-e commented 9 months ago

I think I should clarify my previous long message- The BOM file works great in Excel on Mac and Windows (tested both).

There is a tiny quirk, at least in the Mac version -

BobSimons commented 9 months ago

Thanks for clarifying. So BOM sounds like a good idea.

@ChrisPJohn, it is up to you, but it sounds to me like a good idea to:

Thank you @kutoso, @robitaillej, and @n-a-t-e. This is a good improvement.

ChrisJohnNOAA commented 9 months ago

That sounds like a solid plan. I'll add this to the todo list.