foursquare / fsqio

A monorepo that holds all of Foursquare's opensource projects
Apache License 2.0
254 stars 54 forks source link

"Cambridge, Worcester County, MA" has woeType of 7 #6

Closed steveha-ziprecruiter closed 8 years ago

steveha-ziprecruiter commented 8 years ago

I have found a neighborhood marked with a woeType of 7 (TOWN) and it is causing a quirk in the displayType returned for an actual town.

Here is a Twofishes query for Cambridge with location hint set to the center of Massachusetts, requesting 20 interpretations:

http://demo.twofishes.net/static/geocoder.html?query=Cambridge&ll=42.36565,-71.10832&maxInterpretations=20

Interpretation 1 is the famous Cambridge, MA. As it should, it has woeType set to 7 (TOWN). However, it is shown with a displayName of Cambridge, Middlesex County, MA.

Interpretation 14 is Cambridge, Worcester County, MA. This also has woeType set to 7 (TOWN) which I believe is incorrect. The source is qs_neighborhoods.shp and I believe the woeType should be set to 22 (SUBURB).

Wikipedia shows a neighborhood of Worcester, MA called "Cambridge Street":

https://en.wikipedia.org/wiki/Neighborhoods_of_Worcester,_Massachusetts

Perhaps it would be best if the name of the neighborhood was changed to "Cambridge Street" rather than just "Cambridge"? But if I am not mistaken that would be an issue to file on Quattroshapes and not on Twofishes.

If there is only one interpretation, the displayName is Cambridge, MA as expected:

http://demo.twofishes.net/static/geocoder.html?query=Cambridge&ll=42.36565,-71.10832

Therefore it seems plausible that Twofishes is giving the unusual displayName of Cambridge, Middlesex County, MA only when returning two distinct cities with the same name in the same state, and it's adding county to disambiguate. I am hoping that if the woeType of the "Cambridge Street" neighborhood is properly set to 22 (SUBURB) that my users will consistently get Cambridge, MA as the displayName no matter how many interpretations are requested for a city called "Cambridge".

Are all the places from qs_neighborhoods.shp being loaded with woeType of 7 (TOWN)? If so, this could cause multiple related quirks similar to this one.

rahulpratapm commented 8 years ago

Hi Steve,

Yes, you're right that we add county only to disambiguate (here, if you're curious).

The problem here seems to be that Geonames thinks there's a PPL (populated place) in Worcester County called Cambridge. I don't see evidence that this town exists but I'm a little wary of making that call and editing Geonames. You're welcome to edit or delete it if you're absolutely confident, though. If you choose to edit, changing its name should be straightforward. To make it a neighborhood, change its place code to PPLX.

The source you cited is just the source of the polygon we matched this feature to, and that matching is permissive (or desperate) enough to allow neighborhoods to match cities (this is frequently required in other parts of the world). I would imagine most of the neighborhoods in qs_neighborhoods match neighborhoods rather than towns. In either case, this too is something you have control over during the index build (See here)

Rahul.

steveha-ziprecruiter commented 8 years ago

I've never been to Massachusetts. I'm going by Wikipedia which says that "Cambridge Street" is a neighborhood. Here's the URL again:

https://en.wikipedia.org/wiki/Neighborhoods_of_Worcester,_Massachusetts

The polygon is obviously drawn around Cambridge Street if you look at the map.

I decided to check one I do know about. In Seattle there is a neighborhood called the "University District". I looked it up, and sure enough, (a) Twofishes knows about it, (b) the lat/long is in the right place, and (c) it has woeType of 7 (TOWN) which is just wrong. Checking GeoNames, it's tagged with "PPL".

http://www.geonames.org/7153937/university-district.html

I think this seems like a general problem, but I have no idea how to solve it generally.

I'm not sure what you meant by "this too is something you have control over during the index build (See here)". I know I can simply omit the qs_neighborhoods.shp shapefile from the index build; is that what you meant? If I leave that in, the index build will match the contents of the shapefile against the compiled GeoNames information, and if "University District" is tagged with woeType of 7 (TOWN) then that's how it would be included. So, to my understanding, I can omit neighborhoods entirely, and maybe I can patch them with "hotfixes", but I don't know how to import them all with a different woeType.

rahulpratapm commented 8 years ago

I meant that you can add a .mapping.json file to specify which woeTypes a particular polygon file should be allowed to match. For the purpose of forward geocoding, though, this has no effect as long as the wrongly classified features exist in Geonames.

Going by what you describe, it does seem like a bigger issue with Geonames data. We've never run into it internally at Foursquare since we license proprietary neighborhood data. The best thing to do is bring this up on the forums at geonames.org or maybe email Marc Wick directly (marc@geonames.org) if this is particularly widespread.

steveha-ziprecruiter commented 8 years ago

We might as well close this ticket. It's more properly raised with the GeoNames.org project.

By the way, I contributed an edit to GeoNames that changes the name of the neighborhood to "Cambridge Street" and properly marks it as a neighborhood and not a full city. New builds of the Twofishes indexes don't have the problem of "Cambridge, MA" not being a unique city name; the ambiguity has been resolved.