dr5hn / countries-states-cities-database

🌍 Discover our global repository of countries, states, and cities! 🏙️ Get comprehensive data in JSON, SQL, PSQL, XML, YAML, and CSV formats. Access ISO2, ISO3 codes, country code, capital, native language, timezones (for countries), and more. #countries #states #cities
https://dr5hn.github.io/countries-states-cities-database/
Open Data Commons Open Database License v1.0
7.3k stars 2.54k forks source link

Duplicate entries in cities table #626

Open scottgigante opened 1 year ago

scottgigante commented 1 year ago

There are a number of duplicate entries in cities with identical name, state_code and country_code.

            id         name  state_id state_code        state_name  country_id country_code country_name   latitude   longitude wikiDataId
19430   149214     Zhaoguli      2276         TJ           Tianjin          45           CN        China  39.166730  117.241910  Q13667923
19431   149236     Zhaoguli      2276         TJ           Tianjin          45           CN        China  39.166730  117.241910  Q13667923
45551    57683         Amod      4030         GJ           Gujarat         101           IN        India  21.993170   72.870470   Q2724262
45552   147574         Amod      4030         GJ           Gujarat         101           IN        India  21.991000   72.871000   Q2724262
45740   133106       Nadiad      4030         GJ           Gujarat         101           IN        India  22.693850   72.861570    Q219039
45741   147636       Nadiad      4030         GJ           Gujarat         101           IN        India  22.700000   72.870000    Q219039
46809    57912         Beed      4008         MH       Maharashtra         101           IN        India  18.989210   75.756340    Q814033
46810    58026         Beed      4008         MH       Maharashtra         101           IN        India  18.833330   75.750000    Q814037
46821   147709        Bhoom      4008         MH       Maharashtra         101           IN        India  18.459722   75.660833    Q796520
46822    58024        Bhoom      4008         MH       Maharashtra         101           IN        India  18.459080   75.658770    Q796520
77843   149883     De Weere      2612         NH     North Holland         156           NL  Netherlands  52.774444    4.866944   Q2456668
77844   149884     De Weere      2612         NH     North Holland         156           NL  Netherlands  52.722800    4.998600   Q1810945
77855   149890   Driehuizen      2612         NH     North Holland         156           NL  Netherlands  52.578900    4.808900   Q2594292
77856   149891   Driehuizen      2612         NH     North Holland         156           NL  Netherlands  53.046667    4.776111   Q2115818
80038    79099         Dale      1018         46          Vestland         165           NO       Norway  61.363530    5.400360   Q1783017
80039    79098         Dale      1018         46          Vestland         165           NO       Norway  60.586390    5.818880   Q1951189
80191    79389          Nes      1011         30             Viken         165           NO       Norway  60.122520   11.465700    Q488026
80192    79390          Nes      1011         30             Viken         165           NO       Norway  60.568040    9.101690    Q483986
109947   33614    Chamartín      1189         AV             Ávila         207           ES        Spain  40.702670   -4.957690    Q732971
109948   33615    Chamartín      1189         AV             Ávila         207           ES        Spain  40.462060   -3.676600    Q732971
115384   36744    Salamanca      1147         SA         Salamanca         207           ES        Spain  40.968820   -5.663880    Q577585
115385   36745    Salamanca      1147         SA         Salamanca         207           ES        Spain  40.429720   -3.679750   Q1773521
117299   34207      Fonfría      1161         ZA            Zamora         207           ES        Spain  40.995660   -1.084750    Q574337
117300   34208      Fonfría      1161         ZA            Zamora         207           ES        Spain  41.634930   -6.140810   Q1653149
117452   38044  Villaescusa      1161         ZA            Zamora         207           ES        Spain  41.206170   -5.464060   Q1922230
117453   38045  Villaescusa      1161         ZA            Zamora         207           ES        Spain  42.962620   -4.167060   Q1922230
124695  149321    Baranivka      5071         53  Poltavska oblast         230           UA      Ukraine  50.300000   27.666667    Q148917
124696  149322    Baranivka      5071         53  Poltavska oblast         230           UA      Ukraine  49.948092   34.008276    Q891724

If we normalize unicode to latin-1, we catch even more of these.

            id                      name  state_id state_code                     state_name  country_id country_code country_name   latitude   longitude wikiDataId
8573      8484               Cox's Bazar       803          B            Chittagong Division          19           BD   Bangladesh  21.583890   92.015280   Q1122278
8574      8485               Cox’s Bāzār       803          B            Chittagong Division          19           BD   Bangladesh  21.439730   92.009550    Q949746
19430   149214                  Zhaoguli      2276         TJ                        Tianjin          45           CN        China  39.166730  117.241910  Q13667923
19431   149236                  Zhaoguli      2276         TJ                        Tianjin          45           CN        China  39.166730  117.241910  Q13667923
44906   135200              Mosfellsbaer      3431          1                 Capital Region         100           IS      Iceland  64.150000  -21.650000    Q829859
44907   135201               Mosfellsbær      3431          1                 Capital Region         100           IS      Iceland  64.166670  -21.700000    Q829859
45551    57683                      Amod      4030         GJ                        Gujarat         101           IN        India  21.993170   72.870470   Q2724262
45552   147574                      Amod      4030         GJ                        Gujarat         101           IN        India  21.991000   72.871000   Q2724262
45740   133106                    Nadiad      4030         GJ                        Gujarat         101           IN        India  22.693850   72.861570    Q219039
45741   147636                    Nadiad      4030         GJ                        Gujarat         101           IN        India  22.700000   72.870000    Q219039
46809    57912                      Beed      4008         MH                    Maharashtra         101           IN        India  18.989210   75.756340    Q814033
46810    58026                      Beed      4008         MH                    Maharashtra         101           IN        India  18.833330   75.750000    Q814037
46821   147709                     Bhoom      4008         MH                    Maharashtra         101           IN        India  18.459722   75.660833    Q796520
46822    58024                     Bhoom      4008         MH                    Maharashtra         101           IN        India  18.459080   75.658770    Q796520
66656    67889            Petite Rivière      3259         BL                    Black River         140           MU    Mauritius -20.195510   57.445920  Q24045809
66657   153386            Petite Riviere      3259         BL                    Black River         140           MU    Mauritius -20.193643   57.447747  Q27762631
66664    67837           Brisée Verdière      3254         FL                          Flacq         140           MU    Mauritius -20.163890   57.646670  Q25406451
66665   153435           Brisee Verdiere      3254         FL                          Flacq         140           MU    Mauritius -20.168076   57.652236        NaN
66671    67850                 Clémencia      3254         FL                          Flacq         140           MU    Mauritius -20.263890   57.706110  Q27763620
66672   153439                 Clemencia      3254         FL                          Flacq         140           MU    Mauritius -20.264716   57.713386        NaN
66696   153390                 Mahebourg      3264         GP                     Grand Port         140           MU    Mauritius -20.413962   57.705583        NaN
66697    67874                 Mahébourg      3264         GP                     Grand Port         140           MU    Mauritius -20.408060   57.700000   Q1291379
66712    67855                 Dagotière      3253         MO                           Moka         140           MU    Mauritius -20.244760   57.561880  Q27763707
66713   153446                 Dagotiere      3253         MO                           Moka         140           MU    Mauritius -20.246189   57.559881        NaN
66732    67853               Crève Coeur      3250         PA                  Pamplemousses         140           MU    Mauritius -20.191110   57.557220  Q27763632
66733   153412               Creve Coeur      3250         PA                  Pamplemousses         140           MU    Mauritius -20.187804   57.565305        NaN
66738    67880  Morcellement Saint André      3250         PA                  Pamplemousses         140           MU    Mauritius -20.071990   57.568330    Q282908
66739   153414  Morcellement Saint Andre      3250         PA                  Pamplemousses         140           MU    Mauritius -20.073637   57.566847        NaN
66778    67859       Espérance Trébuchet      3261         RR             Rivière du Rempart         140           MU    Mauritius -20.069720   57.641940  Q27763650
66779   153430       Esperance Trebuchet      3261         RR             Rivière du Rempart         140           MU    Mauritius -20.072479   57.648050        NaN
77843   149883                  De Weere      2612         NH                  North Holland         156           NL  Netherlands  52.774444    4.866944   Q2456668
77844   149884                  De Weere      2612         NH                  North Holland         156           NL  Netherlands  52.722800    4.998600   Q1810945
77855   149890                Driehuizen      2612         NH                  North Holland         156           NL  Netherlands  52.578900    4.808900   Q2594292
77856   149891                Driehuizen      2612         NH                  North Holland         156           NL  Netherlands  53.046667    4.776111   Q2115818
80038    79099                      Dale      1018         46                       Vestland         165           NO       Norway  61.363530    5.400360   Q1783017
80039    79098                      Dale      1018         46                       Vestland         165           NO       Norway  60.586390    5.818880   Q1951189
80191    79389                       Nes      1011         30                          Viken         165           NO       Norway  60.122520   11.465700    Q488026
80192    79390                       Nes      1011         30                          Viken         165           NO       Norway  60.568040    9.101690    Q483986
91060    85861             Bielsko-Biala      1623         SL           Silesian Voivodeship         176           PL       Poland  49.822450   19.046860    Q106583
91061    85862             Bielsko-Biała      1623         SL           Silesian Voivodeship         176           PL       Poland  49.812040   19.038990    Q106583
109947   33614                 Chamartín      1189         AV                          Ávila         207           ES        Spain  40.702670   -4.957690    Q732971
109948   33615                 Chamartín      1189         AV                          Ávila         207           ES        Spain  40.462060   -3.676600    Q732971
115384   36744                 Salamanca      1147         SA                      Salamanca         207           ES        Spain  40.968820   -5.663880    Q577585
115385   36745                 Salamanca      1147         SA                      Salamanca         207           ES        Spain  40.429720   -3.679750   Q1773521
117299   34207                   Fonfría      1161         ZA                         Zamora         207           ES        Spain  40.995660   -1.084750    Q574337
117300   34208                   Fonfría      1161         ZA                         Zamora         207           ES        Spain  41.634930   -6.140810   Q1653149
117452   38044               Villaescusa      1161         ZA                         Zamora         207           ES        Spain  41.206170   -5.464060   Q1922230
117453   38045               Villaescusa      1161         ZA                         Zamora         207           ES        Spain  42.962620   -4.167060   Q1922230
123682  110182             Oktyabr'skoye      4689         43  Autonomous Republic of Crimea         230           UA      Ukraine  45.362150   36.368030   Q1786696
123683  110183             Oktyabr’skoye      4689         43  Autonomous Republic of Crimea         230           UA      Ukraine  45.288660   34.135210   Q1786696
124695  149321                 Baranivka      5071         53               Poltavska oblast         230           UA      Ukraine  50.300000   27.666667    Q148917
124696  149322                 Baranivka      5071         53               Poltavska oblast         230           UA      Ukraine  49.948092   34.008276    Q891724
150200  130957                    Sa'dah      1249         SD                          Saada         245           YE        Yemen  16.940210   43.763930    Q991395
150201  131010                    Şa‘dah      1249         SD                          Saada         245           YE        Yemen  16.917330   43.760000   Q4117548

Repro:

import pandas as pd
import unidecode

BASE_URL = "https://raw.githubusercontent.com/dr5hn/countries-states-cities-database/master/csv"

cities_df = pd.read_csv(BASE_URL + "/cities.csv", na_values='', keep_default_na=False)
cities_df = cities_df.loc[~cities_df['state_code'].isna()]
cities_df['city_id'] = cities_df['country_code'] + "-" + cities_df['state_code'] + "-" + cities_df['name'].apply(unidecode.unidecode)
cities_df.loc[cities_df['city_id'].duplicated(False)].drop("city_id", axis=1)