Closed hellafitz closed 3 years ago
I don't think we can assess the quality up front (for an API or for a static source, really). The best we can do is make an educated guess, and then see how well it works for users in a season.
I've found two data sources: maxmind and geonames. Maxmind has both free and paid versions of their cities data. The free version has about 120k entries and that's what powers the admin dropdowns currently. Geonames has data for cities with populations as small as 500, and that file has more like 200k entries.
Anecdotally, the geonames data has St. Clements while the Maxmind free data doesn't, but that's incredibly weak evidence for anything. I just don't see how we can evaluate it ahead of time.
Geonames offers paid premium data as well. Interestingly, it looks like going to a static data source won't guarantee consistency:
Applications using the Premium Data can reduce their own consistency checks and let the GeoNames team do this job.
Also, we need to think not just about building this once, but the update and maintenance to keep the locations current. Again the paid data provides some help there, but we will have extra work to do:
Major data updates are described in the documentation to make sure applications can react accordingly and take the necessary steps. Example : major city changing the main name like Bombay being renamed to Mumbai.
An alternative to jumping to a static data source may be to working on monitoring and alerting of our current locations code to get better insight into how it's failing.
I'm trying to do a rough analysis now of how many of our current users wouldn't be able to select their location from the list if we backed it with geonames data, and what their current locations are. So far, some of them definitely have garbage data for locations. Others are legitimate locations with fewer than 500 people in the city. Some appear to be valid variations for expressing a state or province.
I'm down a rabbit hole here, but this is what I'm currently getting from my analysis:
Fail match on country: 5313 users, 19 locations Fail match on state/province: 62664, users, 2285 locations Fail match on city: 4994 users, 871 locations
State/province is overwhelmingly where the failure to match against geonames occurs, and that makes sense because the Google API doesn't guarantee what format it's using to give you that data, and geonames wants to talk about admin1 level regions mostly in FIPS, with a bit of ISO thrown in for certain countries. (Taking our data as-is gives even worse results actually, more like 80k not matched on state/province. Attempting to at least translate ISO to FIPS where possible brings it down to 60k.)
In most cases, administrative_area_level_1 short names will closely match ISO 3166-2 subdivisions and other widely circulated lists; however this is not guaranteed as our geocoding results are based on a variety of signals and location data.
So what I think this all means is that we can't really get a great bead on how well a static data source would work for us because our data is so dirty, and we have no realistic way of cleaning it up that I have yet thought of.
I did take a look at some of those 5k city fails though, and you can find cities in the list that do exist in Google Maps and do indeed have populations of under 500 people. So those people would not be able to select their city from a list built off the geonames data, we would have to ask them to select the closest city to them, or something like that.
That in turn suggests to me that we might consider a hybrid model where country and state/province would be selected from fixed lists, and then the city could be entered manually by the user and checked against an API through geocoder to make sure it's understandable and get a lat/long. That way our data is clean at the upper levels, which we use to do things like group users into regions, and at the level of city it might be messier (the same city expressed multiple ways) but still give people full expressive power for telling us where they live.
Oh, I also wanted to mention that at this point I think we would be talking about prototyping such a system independently of the platform. Having had my hands back in this code for a few days now, I'm not confident that pulling out the old stuff to replace it is going to be an easy task.
Collecting some more thoughts here: https://docs.google.com/document/d/1-PT21KBr5eanJaQ0jC7Vt-E3yRY51-3IwTp5gMchIuo/edit#heading=h.nvlsbdi4wnbx
Just acknowledging I read through this. 👍
I've read it through. For reference, here's the MaxMind page describing the paid tiers, which go down to the city level: https://www.maxmind.com/en/geoip2-databases
From #2536 spike on location data process,
Investigate and identify options for using a static datasource for geographical locations associated with users.
(Context/concern: may degrade the UX)
How to assess quality? Granularity of data model? It'd be hard to match vs our current list due to a large number of duplicate entries expressed differently (e.g. Spain vs ES vs ...)
What are the options for user input / how would we implement it?