CLIMADA-project / climada_python

Python (3.8+) version of CLIMADA
GNU General Public License v3.0
310 stars 122 forks source link

Correct Natural Earth ISO codes #30

Closed tovogt closed 4 years ago

tovogt commented 4 years ago

As noted by @sameberenz (see https://github.com/CLIMADA-project/climada_python/commit/1d61fe24780d8f56d98b01fdc0ec46e28ba2b7e3), Natural Earth has some Admin 0 regions without numeric identifiers. Among them is Norway, which is easy to solve since it actually has an ISO code. But there are others that are more complicated to deal with like Kosovo (Kosovo is not in ISO 3166).

Any ideas, what to do about these regions? We could define our own codes above 900, since those are officially omitted in ISO 3166.

These are the Natural Earth NAME attributes of the affected regions together with the numeric codes that have been used by me so far:

{
    "Dhekelia": 826,  # UK
    "Somaliland": 706,  # Somalia
    "Norway": 578,  # Norway
    "Kosovo": 983,  # used in iso3166 package
    "USNB Guantanamo Bay": 840,  # USA
    "N. Cyprus": 196,  # Cyprus
    "Cyprus U.N. Buffer Zone": 196,  # Cyprus
    "Siachen Glacier": 356,  # India
    "Baikonur": 398,  # Kazakhstan
    "Akrotiri": 826,  # UK
    "Indian Ocean Ter.": 826,  # UK
    "Coral Sea Is.": 36,  # Australia
    "Spratly Is.": 912,  # ?
    "Clipperton I.": 250,  # France
    "Bajo Nuevo Bank": 170,  # Colombia
    "Serranilla Bank": 170,  # Colombia
    "Scarborough Reef": 156,  # PR China
}
sameberenz commented 4 years ago

Hi Thomas,

at least for Kosovo, commit https://github.com/CLIMADA-project/climada_python/commit/a7851a872ccb8695bddc505537d5af0246317255 solves the issue, at least for get_country_code(), to get the region_id from coordinates:

from iso3166 import countries as iso_cntry
from coordinates import get_country_code, get_country_geometries

get_country_code([42.666667],[21.166667]) # coordinates of Pristina
>> array([983]) 
iso_cntry.get(983)
>> Country(name='Kosovo', alpha2='XK', alpha3='XKX', numeric='983', apolitical_name='Kosovo')

Retreaving the geometry based on the country name/code does not work for Kosovo, yet (f.i. get_country_geometries(['XKX']) returns an empty dataframe). But this should be possible to fix based on the number 983 and using iso_cntry.get() to set the other codes/names.

For Norway, the commit solves the issue in both directions.

best, Sam

mmyrte commented 4 years ago

A general remark on this topic: I encountered the same issues in 2018 when working on the WISC part. I did report some missing and false data to the Natural Earth project, but haven't ever gotten a response. The issues listed here indicate that it's moving rather slowly, and pull requests don't get accepted too quickly.

Don't get me wrong, I think the use of Natural Earth is sensible regarding polygon resolution etc., but maybe we could (in the long term) look around for other projects. Or we could fork the Natural Earth repo and correct such issues ourselves, or contribute to one of the more active forks.

sameberenz commented 4 years ago

A general remark on this topic: I encountered the same issues in 2018 when working on the WISC part. I did report some missing and false data to the Natural Earth project, but haven't ever gotten a response. The issues listed here indicate that it's moving rather slowly, and pull requests don't get accepted too quickly.

Don't get me wrong, I think the use of Natural Earth is sensible regarding polygon resolution etc., but maybe we could (in the long term) look around for other projects. Or we could fork the Natural Earth repo and correct such issues ourselves, or contribute to one of the more active forks.

Thanks for your insights! Iit would be good to use an alternative. As for ISO-codes, the package iso3166 has never failed me. That's why I use it to fill gaps in nat_earth in https://github.com/CLIMADA-project/climada_python/commit/a7851a872ccb8695bddc505537d5af0246317255.

BTW: I also noticed, that there is another function in https://github.com/CLIMADA-project/climada_python/blob/develop/climada/util/coordinates.py which does almost the same as get_country_geometries(): get_land_geometry(). However, the method and output is slightly different. Maybe they could be combined at some point.

tovogt commented 4 years ago

Thanks for your replies!

The ISO_A3 field is fine in the Natural Earth GeoDataFrame. That's why there is no problem in choosing a specific country. There are no gaps in that sense.

I wasn't clear enough in my start posting: I wanted to point to the problem of numerical representations of regions. And this is not a problem of the underlying Natural Earth data. There is no official way to assign numbers to the regions mentioned in my first post (they are not in ISO 3166!). I just wanted to suggest assigning some (arbitrary?) numbers in the context of CLIMADA. For some regions, we could just assign existing countries, for others we could define new numerical codes. But it's not an urgent issue, of course.

BTW: Yes, get_land_geometry should merely apply shapely.ops.cascaded_union to the output of get_country_geometries. The input parameters should be handled consistently.

sameberenz commented 4 years ago

There is no official way to assign numbers to the regions mentioned in my first post (they are not in ISO 3166!). I just wanted to suggest assigning some (arbitrary?) numbers in the context of CLIMADA. For some regions, we could just assign existing countries, for others we could define new numerical codes. But it's not an urgent issue, of course.

Hi Thomas, thanks for your reply! At least for Kosovo, there is an entry in the iso3166 package, as pointed out above. I don't know, how official the codes provided are, though (name='Kosovo', alpha2='XK', alpha3='XKX', numeric='983').

For the other regions listed, I think both assigning individual numbers or using the numbers above could serve us well, depending on the application. Of course, if you do a risk assessment for the government of North Cyrpus you should not assign it to Cyprus.

mmyrte commented 4 years ago

I also noticed, that there is another function in https://github.com/CLIMADA-project/climada_python/blob/develop/climada/util/coordinates.py which does almost the same as get_country_geometries(): get_land_geometry(). However, the method and output is slightly different. Maybe they could be combined at some point.

Yeah, I think I touched parts of those functions. I didn't want to change others' code, so maybe that's why I ended up providing separate methods.

Regarding the regions that aren't in ISO3166: I don't think we should make up non-standard numeric IDs. We could instead rewrite the country_names argument to something that is clear, like isoa3_names, and then provide other selectors, or just accept a dict to filter the geopandas DF.

tovogt commented 4 years ago

Yes, all of these regions are somehow disputed and therefore it might make sense to assign individual codes in the 900 range.

Regarding Kosovo: The iso3166 package assigns 983 to Kosovo even though this is not in accordance with the ISO-3166 specification. They took the number from the statistical office of Canada: https://github.com/deactivated/python-iso3166/issues/18

Regarding my choice of assignment: For the military bases (like Dhekelia), I assigned the code of the country whose military it is. For other regions, I tried to choose geographically close countries that have an ISO code (e.g. Cyprus). It's definitely not ideal.

The numeric IDs are not for choosing a country, but for representing them in numeric raster maps (e.g. with the region_id attribute of Centroids).

mmyrte commented 4 years ago

Ah, you're right, I forgot that. Sorry, been away for a while. In that case the arbitrary number does make sense. The reliance on region_id always seemed a tad constraining to me, but the only viable alternative would be to simply generate a hash.

tovogt commented 4 years ago

The function most affected by this issue is get_country_code and for this function I now introduced a mapping of disputed areas to values above 900 (in alphabetical order). I think, this solves the main part of the issue.