CartoDB / data-services

CARTO internal geocoder PostgreSQL extension
25 stars 11 forks source link

picks iso codes for admin0_synonyms #236

Open AleGoiko opened 6 years ago

AleGoiko commented 6 years ago

tries to fix https://github.com/CartoDB/data-services/issues/233

Any feedback is welcome, as I am not sure this is the best approach.

AleGoiko commented 6 years ago

do you mean more countries that change or format wise? if its the latter the format looked the same to me, if its the former, not that I saw but I may be missing something.

ethervoid commented 6 years ago

I was thinking if moving from use adm0_a3 to iso_a3 as the principal data for countries changes anything more apart from South Sudan (more discrepancies between iso and the american format)

AleGoiko commented 6 years ago

other than running test.sh is there any other way to test this?

ethervoid commented 6 years ago

I'd compare the two tables to be completely sure

AleGoiko commented 6 years ago

so i tweaked build_synonym_table.sql a little bit more and the result is as follows (being admin0_synonyms_orig the original admin0_synonyms, and admin0_synonyms the new table and both tables containing only the inserts from ne_admin0_v3 ):

dataservices_db=> SELECT adm0_a3, name, rank FROM admin0_synonyms_orig EXCEPT SELECT adm0_a3, name, rank from admin0_synonyms;
 adm0_a3 |               name               | rank 
---------+----------------------------------+------
 SDS     | SDS                              |    3
 CCK     | CCK                              |    3
 BVT     | BVT                              |    3
 BES     | BES                              |    3
 PSX     | West Bank and Gaza               |    5
 ALD     | ALD                              |    3
 SDS     | Republic of South Sudan          |    5
 SAH     | Western Sahara                   |    9
 REU     | REU                              |    3
 SAH     | W. Sah.                          |    8
 SDS     | S. Sud.                          |    8
 GUF     | GUF                              |    3
 PSX     | PSX                              |    3
 GLP     | GLP                              |    3
 MTQ     | MTQ                              |    3
 FRH     | FRH                              |    3
 TKL     | TKL                              |    3
 SAH     | ESH                              |    0
 SDS     | SSD                              |    0
 SAH     | SAH                              |    3
 SJM     | SJM                              |    3
 PSX     | PSE                              |    0
 MYT     | MYT                              |    3
 ALD     | Åland Islands                    |    5
 CXR     | CXR                              |    3
 ALD     | ALA                              |    0
 SAH     | Sahrawi Arab Democratic Republic |    5
 SDS     | South Sudan                      |    9

the entries that are missing seem outdated or redundant anyway.

I was having a look at other files and I am not sure if we need to change them since the old codes are still around, only that with a lower ranking

ethervoid commented 6 years ago

Ok, looks good. Check why the tests are failing and we can move on

AleGoiko commented 6 years ago

restest this please

AleGoiko commented 6 years ago

@ethervoid looks like it was a glitch :sweat_smile:

ethervoid commented 6 years ago

Great that fixes the underlying problem, we need to create a patch for the current data, as you can see here in order to avoid recreate all the adm0 data. What do you think?

AleGoiko commented 6 years ago

ok so, here's where I stand right now with this:

I created the following tables:

the difference between admin0_synonyms and admin0_synonyms_iso is as follows(remove the .zip extension): SELECT adm0_a3, name, rank FROM admin0_synonyms EXCEPT SELECT adm0_a3, name, rank from admin0_synonyms_iso;

It is mostly outdated or repetitive data except for lines such as:

ESP,Sp.,8

which seem to be missing from the new admin0_synonyms_iso table.

However, upon closer inspection of the build_synonym_table.sql this explains why it is missing:

INSERT INTO admin0_synonyms (name, rank,  adm0_a3) 
SELECT 
    abbrev, 8, adm0_a3
FROM
    ne_admin0_v3
WHERE
    char_length(regexp_replace(abbrev, '[^a-zA-Z\u00C0-\u00ff]+', '', 'g')) > 3;

for some reason the admin0_synonyms in the db_dump files does not seem to incluide this query, however the adm0_a3_local does include it as it follows build_synonym_table.sql. In this case the difference (between the iso table and the regular one created locally) the diff is as follows: SELECT adm0_a3, name, rank FROM admin0_synonyms_local EXCEPT SELECT adm0_a3, name, rank from admin0_synonyms_iso;

 adm0_a3 |               name                | rank 
---------+-----------------------------------+------
 CCK     | CCK                               |    3
 CCK     | Cocos (Keeling) Islands           |    0
 BVT     | BVT                               |    3
 REU     | Reunion                           |    0
 MTQ     | Martinique                        |    0
 SAH     | Western Sahara                    |    9
 SDS     | S. Sudan                          |    0
 SDS     | S. Sud.                           |    8
 GUF     | GUF                               |    3
 SJM     | Svalbard and Jan Mayen            |    0
 MTQ     | MTQ                               |    3
 FRH     | FRH                               |    3
 ALD     | Aland                             |    0
 SAH     | ESH                               |    0
 FRH     | Corse                             |    0
 CXR     | CXR                               |    3
 SAH     | Sahrawi Arab Democratic Republic  |    5
 GLP     | Guadeloupe                        |    0
 SDS     | South Sudan                       |    9
 SDS     | SDS                               |    3
 BES     | BES                               |    3
 PSX     | West Bank and Gaza                |    5
 ALD     | ALD                               |    3
 SDS     | Republic of South Sudan           |    5
 REU     | REU                               |    3
 SAH     | W. Sah.                           |    8
 BES     | Bonaire (Sint Eustatius and Saba) |    0
 TKL     | Tokelau                           |    0
 PSX     | PSX                               |    3
 GLP     | GLP                               |    3
 GUF     | French Guiane                     |    0
 MYT     | Mayotte                           |    0
 CXR     | Christmas Island                  |    0
 BVT     | Bouvet Island                     |    0
 TKL     | TKL                               |    3
 SDS     | SSD                               |    0
 SAH     | SAH                               |    3
 SJM     | SJM                               |    3
 PSX     | PSE                               |    0
 MYT     | MYT                               |    3
 ALD     | Åland Islands                     |    5
 ALD     | ALA                               |    0
 PSX     | Palestine                         |    0
 SAH     | W. Sahara                         |    0

and it makes much more sense.

I am very lost at this point, not sure if the sql generated in admin0_synonyms_iso is a suitable candidate to replace the current admin0_synonyms in the dumps :(

ethervoid commented 6 years ago

What does Rank mean? what is the main use of the rank column?