CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.13k stars 18.42k forks source link

Country/Region should be written as ISO or CLDR country-code #470

Open designbyadrian opened 4 years ago

designbyadrian commented 4 years ago

In your latest files, you have updated the names of a few countries, including:

Iran (Islamic Republic of) Republic of Korea Hong Kong SAR Taipei and environs Viet Nam occupied Palestinian territory Macao SAR Russian Federation Republic of Moldova Saint Martin Channel Islands Holy See

This leads to a mismatch between latest and previous data.

I suggest you only use ISO codes to identify country, and let whoever consumes your data to translate code to country name.

Eclipsed830 commented 4 years ago

While I understand you are referring to the two letter country code, it should be pointed out that one issue with ISO country codes is they are politicized in nature as being a member of the ISO also requires UN Membership. This is why most software developers instead use Unicode CLDR. http://cldr.unicode.org/translation/displaynames/country-names

https://github.com/unicode-cldr/cldr-localenames-full/blob/master/main/en/territories.json

designbyadrian commented 4 years ago

Ohh! I didn't now this!

Well, I'm open for any solution that doesn't require comparing free text to free text.

Eclipsed830 commented 4 years ago

Ohh! I didn't now this!

Well, I'm open for any solution that doesn't require comparing free text to free text.

It's a complicated world out there! :P For example, if we use the country name from ISO 3166, Taiwan would be "Taiwan (Province of China)" which would essentially put us in the same sinking boat we are in today as Taiwan isn't and has never been a "Province of China". Furthermore, if you ship a device like this there is a chance you won't be able to sell it to the Taiwanese market. On the flip side, if you label Taiwan as Taiwan you might also not be able to sell to the Chinese market. lol

designbyadrian commented 4 years ago

For sure! But I suppose that CLDR has a different amount of entries than ISO 3166 (haven't checked yet)

My opinion is that a quick solution is that even if you use ISO 3166, you could only use the codes only – ignoring the names – and keep a translation of your own in whichever region your in.

3verse commented 4 years ago

Thank you for raising the issue @designbyadrian, My 2pence, any changes in names should be propagated across the whole dataset, not just one day of data - otherwise, for everyone out there (including me) using this data to trend daily developments, inconsistent country names will only lead to poorly usable data and having to create workarounds to make a sense of it on a daily basis.

I believe there's no political statement in asking to have consistent data.

Bost commented 4 years ago

This is why most software developers instead use Unicode CLDR. http://cldr.unicode.org/translation/displaynames/country-names

https://github.com/unicode-cldr/cldr-localenames-full/blob/master/main/en/territories.json

Uhm, that's new to me so before I give you a :+1: I need to take a proper look at it. Anyway thank you.

BTW this discussion is a dupe of https://github.com/CSSEGISandData/COVID-19/issues/372

designbyadrian commented 4 years ago

@Bost Dang, and I thought of posting this several days ago 😆

Closing this in favour of #372 and #482