datasets / un-locode

United Nations Codes for Trade and Transport Locations (UN/LOCODE) and Country Codes
https://datahub.io/core/un-locode
146 stars 56 forks source link

Aliases are duplicated #28

Open cristan opened 1 month ago

cristan commented 1 month ago

Check out https://github.com/datasets/un-locode/blob/main/data/alias.csv

Let's take the first line:

GL,Christianshaab = Qasigiannguit (Christianshaab),Christianshaab = Qasigiannguit (Christianshaab)

That's there twice (also at line 88). This applies to all the lines I've checked.

sabas commented 1 month ago

chatgpt suggests simply to drop duplicates :D , will see after other PR are discussed (@gradedSystem)

# Collect alias rows in a list
alias_list = []

for index, row in unlocode_df.iterrows():
    if pd.isna(row['Location']) or row['Location'] == '':
        if row['Change'] == '=': # alias row
            alias_list.append(row[['Country', 'Name', 'NameWoDiacritics']])

# Create alias_df from the list
alias_df = pd.DataFrame(alias_list, columns=['Country', 'Name', 'NameWoDiacritics'])
alias_df.drop_duplicates(inplace=True)

# Save the alias DataFrame to CSV
alias_df.to_csv(f"data/alias.csv", index=False)
gradedSystem commented 1 month ago

@sabas what if we just do something like this (using simple regex operator):

GL,Christianshaab, Qasigiannguit

wdyt?