datasets / un-locode

United Nations Codes for Trade and Transport Locations (UN/LOCODE) and Country Codes
https://datahub.io/core/un-locode
146 stars 56 forks source link

Add subdivision name to data/subdivision-codes.csv #23

Closed cristan closed 8 months ago

cristan commented 9 months ago

In the UN/LOCODE CSV's you can see whether a subdivision is a state, a province or something else. Unfortunately, that's not present in subdivision-codes.csv in this repo. Please add this.

sabas commented 9 months ago

Hi, where do you see it? In UNECE's release there are only three fields in the subdivision table..

cristan commented 9 months ago

https://unece.org/trade/cefact/UNLOCODE-Download > CSV > 2023-2 SubdivisionCodes.csv

The first entries in there:

"AD","02","Canillo","Parish"
"AD","03","Encamp","Parish"
"AD","04","La Massana","Parish"
"AD","05","Ordino","Parish"

On this project:

SUCountry,SUCode,SUName
AD,02,Canillo
AD,03,Encamp
AD,04,La Massana
AD,05,Ordino

Maybe you can combine the data? I couldn't for the life of me figure out the encoding of the CSVs in the zip-files they have, but the subdivision name never has special characters, so getting that from the CSV would work.

sabas commented 9 months ago

@cristan I chose to dump from the Access release to solve the encoding issue :) Strange that there's a difference between them... I try to write to the UN office to look into this.

cristan commented 9 months ago

Speaking about differences between the Access release and the CSV, there are more differences:

In your project, this entry is duplicated:

IN,JK,Jammu and Kashmīr
IN,JK,Jammu and Kashmīr

It isn't in the CSV version.

The same applies to CHT in MA. It is duplicated here:

MA,CHT,Chtouka-Ait Baha
MA,CHT,Chtouka-Ait Baha

Whereas it isn't in the official CSV. Will you pass this along or shall I report this at https://apps.unece.org/unlocode/ ?

sabas commented 9 months ago

@cristan thanks, I added an uniq pass to the csv, so these duplicates now are removed (they are in the Access DB). I sent a link to this issue, when I get feedback I will update :-)

cristan commented 8 months ago

In case you're interested: another thing which is present in the CSV and not in this project are the reference entries. Some examples:

"=","AT","","Vienna = Wien","Vienna = Wien","",,"",,"","",""
"=","BE","","Antwerp = Antwerpen","Antwerp = Antwerpen","",,"",,"","",""
"=","CA","","Three Rivers = Trois-Rivi�res","Three Rivers = Trois-Rivieres","",,"",,"","","""=","CH","","Geneva = Gen�ve","Geneva = Geneve","",,"",,"","",""

There are less than 100 and I'm not sure how many people care about these, but it is a difference.

sabas commented 8 months ago

Hi @cristan For reference entries I knowingly dropped those to have a clean dataset... Perhaps we can make it as an additional csv?

Some days ago I asked during the unlocode maintenance meeting about the access database issue and they confirmed there's an issue in the current application.. We can probably switch the script to use the original csv file for the conversion.

I will take a look on this during the week probably !

cristan commented 8 months ago

The completionist in me would indeed like an additional CSV. It makes it more clear that is actually also exists in unlocode, I had no idea up until recently. Still, I don't think I'd actually use it, so feel free to omit it if you want.

And excellent! It would be nice if I can just use this repo instead of having to script getting the subdivision name from the CSV.

sabas commented 8 months ago

@cristan I committed the updated dataset if you want to check!

cristan commented 8 months ago

Bad luck: I'm afraid this change made things much, much worse. Some entries have lost their diacritics:

AF,BAM,Bamyan,Province
AF,BDG,Badghis,Province

Others are just wrong:

AZ,DAS,Dask?s?n,Rayon
AZ,FUZ,Füzuli,Rayon
AZ,GA,G?nc?,Municipality
AZ,GAD,G?d?b?y,Rayon

But not just the subdivision codes are wrong now, also unlocode entries are wrong now:

,LV,MPS,Mālpils,Mālpils,RIX,RL,--3-----,1301,,5700N 02457E,

has become

,LV,MPS,Malpils,Malpils,RIX,RL,--3-----,1301,,5700N 02457E,

And yes, Mālpils is how you write it.

I doubt whether reading the CSV's is even possible. I even tried out 141 different encoding for you. I created this script:

#!/bin/bash

# Path to the UN/LOCODE CSV file
csv_file="unlocode.csv"

# Directory to store converted files
output_dir="converted_files"

# Create output directory if it doesn't exist
mkdir -p "$output_dir"

# List of supported encodings
encodings=$(iconv -l)

# Loop through each encoding and attempt conversion
for encoding in $encodings; do
    echo "Trying encoding: $encoding"
    output_file="${output_dir}/unlocode_${encoding}.csv"
    iconv -f $encoding -t UTF-8 "2023-2 SubdivisionCodes.csv" > "$output_file" 2>/dev/null
    if [ $? -eq 0 ]; then
        echo "Conversion successful with encoding: $encoding"
        echo "Output saved to: $output_file"
    fi
done

Then I ran this:

grep -Rnw . -e Z̧aby

(this to find this line):

AE,AZ,Abū Z̧aby [Abu Dhabi],Emirate

No results. I even tried out adding the old file before the revision in the directory. Then I could find it, so the grep does work.

So they are either using a super obscure encoding, they created their own custom encoding (possibly because of a bug in a convert script where they have selected the wrong source encoding or something similar) or doing some other voodoo like having less diacritics in the CSV than in the MDB.

sabas commented 8 months ago

I assumed Windows encoding but I didn't spot check in Excel... What do you suggest?

cristan commented 8 months ago

Revert back to reading from the MDB 😕. As I said, I checked 141 encodings and none of them work, so I think the encoding of the CSV is just wrong.

I've had this before: when you convert from 1 encoding to another, but set the source encoding wrong, you'll basically invent a new encoding which is almost impossible to read. I suspect something like this has happened at unlocode.

If you still want to add the region names to this project, you could combine the region types from the CSV with the region data from the MDB. The region types don't have special characters, so this would work, though it's not the neatest of solutions.

sabas commented 8 months ago

grouchy_smurf.jpg

I did a patchwork and now I take the mdb release and merge data from the csv for subdivisions. Looks better?

cristan commented 8 months ago

Yeah, it's pretty sucky situation with the encodings an the CSV. Anyway, there's still more work to do:

First of all: there's a problem with the first line:

AD,02,Canillo,

This should be a parish:

AD,02,Canillo,Parish

Then we have the weird situation that country codes / subdivision codes have disappeared since before you started involving the CSV:

Now:

AZ,,Naftalan,Municipality
ES,,Navarra / Nafarroa,Province
IQ,,An Najaf,Governorate
IT,,Napoli,Metropolitan City
LB,,Nabatîyé,Governorate
,CA,Zambezi,Region
,ER,Erongo,Region
,HA,Hardap,Region
,KA,//Karas,Region
,KE,Kavango East,Region
,KH,Khomas,Region
,KU,Kunene,Region
,KW,Kavango West,
,OD,Otjozondjupa,Region
,OH,Omaheke,Region
,ON,Oshana,Region
,OS,Omusati,Region
,OT,Oshikoto,Region
,OW,Ohangwena,Region
NG,,Nasarawa,State

Before:

AZ,NA,Naftalan
ES,NA,Navarra / Nafarroa
IQ,NA,An Najaf
IT,NA,Napoli
LB,NA,Nabatîyé
NA,CA,Zambezi
NA,KA,//Karas
NA,KE,Kavango East
NA,KH,Khomas
NA,KU,Kunene
NA,KW,Kavango West
NA,OD,Otjozondjupa
NA,OH,Omaheke
NA,ON,Oshana
NA,OS,Omusati
NA,OT,Oshikoto
NA,OW,Ohangwena
NG,NA,Nasarawa

Lastly, there are duplicates again:

IN,JK,Jammu and Kashmīr,Union territory
IN,JK,Jammu and Kashmīr,Union territory
MA,CHT,Chtouka-Ait Baha,Province
MA,CHT,Chtouka-Ait Baha,Province
MA,KES, El Kelâa des Sraghna,Province
MA,KES, El Kelâa des Sraghna,Province
MA,KES,Kelaat Sraghna,Province
MA,KES,Kelaat Sraghna,Province
MK,205,Karbinci,Municipality
MK,205,Karbinci,Municipality
cristan commented 8 months ago

I also thought there was an inconsistency with the usage of quotes:

FR,69M,Métropole de Lyon,"Metropolitan collectivity with
special status"
GB,VGL,"Vale of Glamorgan, The [Bro Morgannwg GB-BMG]",Unitary authority

But that's caused by the fact that some types have a newline in them :P

sabas commented 8 months ago

The newlines were already fixed, for Canillo the line to be fixed is df = pd.read_csv('tmpsub.csv', encoding='cp1252', dtype=str, header=None)

merged_df = merged_df.drop_duplicates() for duplicates

For missing data, it's pandas who helpfully converts NA to null.....................

EDIT: Pushed the fix

cristan commented 8 months ago

Doublechecked it, and it is exactly right. Thank you!

rufuspollock commented 8 months ago

@cristan @sabas great work 👏

Big props to @sabas for all the maintenance here!