datasets / un-locode

United Nations Codes for Trade and Transport Locations (UN/LOCODE) and Country Codes
https://datahub.io/core/un-locode
142 stars 55 forks source link

NameWoDiacritics doesn't work for some diacritics #26

Closed cristan closed 5 months ago

cristan commented 6 months ago

For some diacritics, NameWoDiacritics works just fine, like the letter ü

,NA,LUD,Lüderitz,Luderitz,,AI,1--4----,0212,,2639S 01510E,

For other, more exotic diacritics like ă, this doesn't work.

,MD,VUL,Vulcănesti,Vulcănesti,GA,RL,--3-----,2301,,4541N 02824E,

This kinda makes sense, in the last Secretariat notes, you'll see see that these characters are substituted for a: à, á, â, ã, ä, å, æ. ă is notably absent in this list.

I've ran a script to test them all, and these are the special characters still present in NameWoDiacritics: ň, č, ť, ř, ě, ň, č, ť, Č, ć, ů, ő, ē, ā, ī, Ġ, Ġ, ł, ę, ţ, ľ, ď.

Disclaimer: my script didn't actually extract the specific special characters, so I might have missed a few. Let me know if you are actually interested in this. And to be honest, I don't really mind considering I can read the file just fine with this project, but maybe you'd want to know.

sabas commented 6 months ago

Yeah kindly note this also for when we will start to review the new system requirements... This is probably a bug of the output script, we can fix it in this dataset by passing that column through some kind of filter, or do you want to keep it as this to allow for data quality checks?

cristan commented 6 months ago

There are arguments for both. Adding an extra filter would make the column actually without diacritics. On the other hand: keeping it as is would make testing the new system easier whether this issue is actually solved there.

Since you can argue either way, I would do nothing and close this story until you get a comment / issue from somebody who actually uses the entries in NameWoDiacritics.

sabas commented 5 months ago

Needs update in the official release process.