datasets / un-locode

United Nations Codes for Trade and Transport Locations (UN/LOCODE) and Country Codes
https://datahub.io/core/un-locode
142 stars 55 forks source link

Almost impossible to get a reliable location for a unlocode #25

Closed cristan closed 5 months ago

cristan commented 6 months ago

The problem

The actual problem I want to solve is to figure out where unlocodes are within 250 km, and it seems like it's an almost impossible one. Basically the same as this stack overflow question. (for posterity: all given examples apply to UN/LOCODE 2023.2)

Why is that hard? There's a location field in the unlocodes, right? Nope, over 20% of locations don't have coordinates.

Ah, that's probably just the really small towns, right? Nope, even San Francisco (USSFO) doesn't have GPS coordinates.

Ok, but the coordinates who are there are correct, right? Nope, there are well over a hundred locations where North and South and/or East and West are swapped, like NLBDV

That's fairly easy to detect with a script, if you correct that, is it good now? Nope, you have cases with other typos like US5WS where the coordinates (3058N 07616W) should be 3959N 07617W

Alright, often the coordinates are plainly wrong (source). Somehow filter out those and you should be set, right? I wish. CNYTN (Yantian Pt) points to Yantianxiang, not Yantian. DEWDF points to Walldorf, Schmalkalden-Meiningen, not Walldorf, Baden-Württemberg

Other data sources

Because the locations in unlocode are so bad, I've started looking at other data sources

IATAs

There are only 722 unlocodes with an explicit IATA. In other cases, it's unclear. You could cross check with an IATA data source to see if the city name matches or something, but this is quite a bit of work and only gives you coordinates of a small percentage of locations.

AtoBviaC

This is actually very promising. For example, CNYTN (noted above) actually has the correct coordinates instead of the wrong one in unlocode. It makes sense that the coordinates are good: these are actually used. Still: I found cases where I think the coordinates on this site are wrong: ARBEL, DCIN, ARCOO, UADNK, CAFRM, CNHAA, BRITA, BRMGU, BRSAN, RUVTZ. Still, not too bad considering I found around 2600 unlocodes there. That is still very limited though, considering there are over 100000 unlocodes.

UN-Locode-with-Timezone (which uses GeoNames)

GeoNames looks good. If I search for CNYTN, I am redirected to the correct coordinates, rejoice!

But no, the "perfect matches" are actually terrible. This might be caused by the fact that the repo hasn't been updated in 4 years. Maybe it's better now, but the repo has been generated from some random serverless DB. If you want to do it proper, you'll need the GeoNames Premium Data Subscription, and I don't know if the data is good now, so I don't know if it's worth it.

The other ones (who don't directly match via unlocode, but via city & state) aren't bad, though there is still wrong data here.

Pay for the OpenCage API

Haha, no. If you try out their demo page and fill in LOCODE:CDBNW, you'll get the same north and south swapped coordinates as in the unlocode original.

Nominatim

Just search for the city, state and country in the Nominatim API.

https://nominatim.openstreetmap.org/search?format=jsonv2&accept-language=en&addressdetails=1&limit=10&city=White%20Rock&state=BC&country=CA for CAWHR

(omit the state when not specified or when there are 0 results)

This is very interesting, and usually yields much better results than the unlocode coordinates. But unfortunately not always. For example: MXECA has correct coordinates, but this nominatim query first returns the wrong town Carmen, before it returns the correct El Carmen as the second hit. The search for Chacabuco first returns a river (?) before returning the correct town. The only 1 I could found where Nominatim is lacking is BEHAW. The coordinates point to Ham-sur-Heure (which is probably fine, hard to tell with the pretty useless WAL region), but nominatim just doesn't return that. There are probably more cases, but this is very rare.

But still, I consider Nominatim an excellent data source (better than unlocode). There is a limit on how often you can call it, so calling it for all 100000+ unlocodes at once isn't an option. It also means that you'd want caching if you want something like this running in production.

How to get a reliable location

I still have to think about this one, but I think multiple data sources have to be involved. Like do a call to Nominatim and check if any of the results match the unlocode database within 100km. If yes, take that one (and cache it, so you don't have to call Nominatim for this again) and if no, flag it for a manual check. This is quite a bit of work to implement and maintain, but I don't know of another way. Anything which might help me here is welcome.

Super edge cases

There are also hard / weird cases like GREFP whose both coordinates as well as the name is wrong (should be Efpalio without the n). Or SKJAO who has wrong coordinates, but if you correct them, you'll end up with SKJSV, so SKJAO should probably be deleted.

Reporting these mistakes

I bet you're wondering whether I've reported all the mistakes I found. And the answer is yes 😁. I haven't heard back at all, and I reported a lot (around 300), so I hope they are able to look at everything I reported. Maybe I'll have time to finalise the scripts so I'll be able to report all of the coordinates who are off by more than 1000km before the next cut-off date, but I already should have reported most of them which already would be a huge improvement.

Lastly, feel free to close this issue after you have given your 2 cents, because I wouldn't be surprised if this this is out of scope for this project.

sabas commented 6 months ago

@cristan can you please forward me the email with the report you made? I try to discuss it at the maintenance meeting this Thursday if I can join it... :)

cristan commented 6 months ago

I now basically fixed the problem myself :D. See https://github.com/cristan/improved-un-locodes. It has both corrected coordinates as well as way more of them (96,5% of UN/LOCODES have coordinates here). This percentage can even go up with Wikidata and Geonames integration, though I imagine it won't go up by much.

It also contains many script to analyse the UN/LOCODE dataset and report various issues. (most of the code isn't too pretty though, but it works)

sabas commented 5 months ago

Solved by Cristan's process

HC-47 commented 1 month ago

I don't understand. Has the solution by Cristian been included in the data or should we implement it ourselves? Just to know if I shall try to fix the missing data or I can surrender to them :D

sabas commented 1 month ago

@HC-47 use cristan's fork for complete data... The official dataset is dependent on the existing workflow so it cannot get all the "fixes" immediately

HC-47 commented 1 month ago

@sabas, ok thanks 👍