OpenHistoricalMap / issues

File your issues here, regardless of repo until we get all our repos squared away; we don't want to miss anything.
Creative Commons Zero v1.0 Universal
18 stars 1 forks source link

Nominatim search for "san marino" returning no results when it should #407

Closed danrademacher closed 2 years ago

danrademacher commented 2 years ago

Bug description This search for San Marino has no results: https://openhistoricalmap.org/search?query=san%20marino#map=7/44.153/-120.515&layers=O&date=1859&daterange=1859,2022 image

San Marino is quite well mapped: https://www.openhistoricalmap.org/#map=13/43.9399/12.4725&layers=O&date=1900&daterange=1800,2021

Here's a relation with "San Marino" in the name: https://www.openhistoricalmap.org/relation/2692735#map=13/43.9399/12.4510&layers=OND&date=301&daterange=301,2021

We would expect that to be found when searching for "san marino".

This works though: image

https://www.openhistoricalmap.org/search?query=San%20Marino%201463-present#map=13/43.9426/12.4594&layers=OND&date=301&daterange=301,2021

this does not: https://www.openhistoricalmap.org/search?query=San%20Marino#map=13/43.9426/12.4594&layers=OND&date=301&daterange=301,2021 image

danrademacher commented 2 years ago

This is not a simple case of the API not responding that we could try to address on the infrastructure side.

It also does not seem to be some weird bug related to "years in name make substring fail" since this works: https://staging.openhistoricalmap.org/search?query=comancheria#map=7/34.746/-100.629&layers=O&date=1922-01-01&daterange=1922-01-01,2022-12-31

image

So then what is it about "San Marino 1463-present" that causes the search to return no results?

danrademacher commented 2 years ago

This is also true of Oregon:

This one has 4 results: https://www.openhistoricalmap.org/search?query=Oregon%20Territory

But this has only one: https://www.openhistoricalmap.org/search?query=Oregon

At least part of the issue seems to be that it is not doing substring search

danrademacher commented 2 years ago

@batpad who do we know in the larger OSM community who we could reach out to on this one?

It seems clear that the data is in Nominatim's database, so not an issue of data syncing between main DB and Nominatim, but one can only get those results by typing in the exact name to get a result.

Looking at OSM for "oregon" and you get a lot more results: https://www.openstreetmap.org/search?query=oregon#map=13/41.6782/-83.4387

It appears to be doing substring search on, eg, Oregon County, but I wonder if that's really true or if all these items have other searchable tags that are just "oregon"

EG, here's "Oregon county" with an "alt name" of "Oregon" https://www.openstreetmap.org/relation/1180502

But this one seems like a substring: https://nominatim.openstreetmap.org/ui/details.html?osmtype=W&osmid=722118662&class=man_made

https://www.openstreetmap.org/way/722118662

The only visible instance of "Oregon" there is name:etymology:wikipedia | en:Hawthorne Boulevard (Portland, Oregon)

So the question is how can we get our instance of Nominatim to treat name | San Marino 1463-present the same as OSM Nominatim is treating name:etymology:wikipedia | en:Hawthorne Boulevard (Portland, Oregon)

danrademacher commented 2 years ago

I discussed this with @batpad and here are notes from that:

Potentially 3 things:

There's folks doing something similar to OSM-seed and they have a Pelias container packaged: https://github.com/headwaymaps/headway/tree/main/services/pelias - it seems like it might not be too much work to set it up and get it to read from our replication and just explore if that gives us better results / maybe more configurable, etc.

I'll check with @geohacker about an intro to Sarah Hoffman. Since OSM is using Nominatim and getting better substring search results, this feels like something that might be solvable without actually getting into Nominatim internals. Though that won't tell us if Pelias might be a better long term search solution...

geohacker commented 2 years ago

@danrademacher I added another point in this ticket https://github.com/OpenHistoricalMap/issues/issues/243#issuecomment-1156246662 about admin levels used to construct the display name. I'm not entirely sure it's related but just wanted to link here as we look into this.

geohacker commented 2 years ago

I'll try to tag @lonvia here to see if she might have any ideas. I mentioned to Sarah at SOTM that we are using Nominatim for OHM and have some quirks which we aren't entirely sure about.

To summarise, in this particular case:

  1. There's an admin level 2 relation https://www.openhistoricalmap.org/relation/2692735#map=13/43.9397/12.4509&layers=OND&date=1922-01-01&daterange=1922-01-01,2022-12-31 named San Marino 1463-present
  2. There's a place=country node named San Marino https://www.openhistoricalmap.org/node/2090640309#map=20/43.93771/12.46485&layers=OND&date=1463-06-27&daterange=1463-06-27,2022-12-31
  3. There's a place=city node named City of San Marino
  4. But searching san marino brings up no results.

I think we are most certainly misinterpreting Nominatim's behaviour and expecting something that our instance isn't configured properly for. @lonvia, would be great if you have any thoughts or directions you can point us in. Thank you!

lonvia commented 2 years ago

That's indeed a problem of partial name matching, i.e. Nominatim having trouble to match san + marino with "San Marino 1463-present". The good news is that much of this is improved in the latest 4.1 version. So updating your installation and reimporting the Nominatim database might already solve that particular issue. However there is a fundamental problem with your names here, which you should look into.

Nominatim has a very heavy bias towards matching against the full strings in the name tag. That can't be changed without ending up with a lot of false positive results. Names like 'San Marino 1463-present', which contain multiple pieces of information, are really bad. There are three ways to solve the problem:

1) Introduce the equivalent of start_date and end_date and move the extra information to separate tags. This is the cleanest solution but not always possible. 2) Establish the convention that the date information must go into brackets: San Marino (1463-present). Nominatim's tokenizer already has a special handling of names with brackets like that and will assume that 'San Marino' is a full name in this case. To be precise, it will add the full names 'San Marino' and 'San Marino 1463-present'. Maybe exactly what you need, maybe not. 3) Write your own tokenizer. That would be the really advanced version. The newest version of Nominatim allows you to preprocess the names before they are added into the search index, see this tutorial (part 'Write your own sanitizer'). So, whatever conventions you come up with for the name tag, you can write your own parser for that.

The particular result should also be findable via the linked place name but I'd need the link to your Nominatim installation directly to check what is going on in the database. It's quite probable that this has been solved in the newest Nominatim version, too, with https://github.com/osm-search/Nominatim/pull/2637.

danrademacher commented 2 years ago

Thank you @lonvia for the quick and detailed response! This is great news -- so we can focus on (a) updating to latest Nominatim and (b) keeping dates out of names.

@jeffreyameyer for the dates in names, I think if we want those to appear in various places, we should try to add them from our already existing start_date and end_date tags programmatically, like appended to labels in map tiles at generation, or added to sidebar names via Rails or Javascript code. That way the name stays what it is and dates get added where we want them consistently, without confusing the name field.

We'll still want to upgrade Nominatim to get best possible results, but this is a good outcome!

Rub21 commented 2 years ago

@danrademacher , I am going to work on updating nominatim to the latest version!! let see if how that works.

danrademacher commented 2 years ago

Excellent!

I went ahead with the brackets/parentheses change as the first and easiest thing: https://www.openhistoricalmap.org/changeset/42589#map=13/43.9427/12.4594&layers=OND&date=0301-09-03&daterange=0301-09-03,2022-12-31

But in our current version of Nominatim at least, that didn't make any difference: https://www.openhistoricalmap.org/search?query=San%20Marino#map=17/43.95403/12.40774&layers=O&date=1922-01-01&daterange=1922-01-01,2022-12-31

Also note that we recently got a request to add dates to feature labels in iD, https://github.com/OpenHistoricalMap/issues/issues/430. Not sure how something similar could be done in JOSM. That would help a lot in terms of shoing dates without messing with the name value

lonvia commented 2 years ago

The brackets should have worked even in the old Nominatim version. I see now that there is another problem with San Marino. It is a boundary on admin_level=2 aka country level. This is going to confuse Nominatim because countries are handled special. And Nominatim makes the assumption that there is exactly one country version of each country. You probably have more than one. At this point we should have a chat about how you expect the historical data to be handled by Nominatim when it is creating addresses. Depending on the answer we can see how to tweak your instance to handle the data right.

danrademacher commented 2 years ago

OK, thanks for the further look @lonvia . @jeffreyameyer we're going to need you to weigh in on this bigger issue of our expectations for Nominatim, esp around countries. It is definitely the case that most if not all countries will have many iterations of their borders over the centuries/millennia that could be covered in OHM. We might want to split this out into another ticket. I'm not quite sure, to be honest, how we clearly define our expectations of "how the historical data is to be handled by Nominatim when it is creating addresses". but Jeff and others have thought more deeply about such higher level questions than I have!

Rub21 commented 2 years ago

Nominatim container has been updated to 4.1 !! but does not show the san marino results yet.

danrademacher commented 2 years ago

OK, thanks @Rub21! Based on @lonvia's last note, it sounds like we have additional issues to work through here about how Nominatim treats admin-2/countries. I reassigned this ticket to @jeffreyameyer for follow up on our larger expectations for how Nominatim should be working

danrademacher commented 2 years ago

Further notes from team discussion

jeffreyameyer commented 2 years ago

@lonvia - first of all, THANK YOU SO MUCH for showing up to help us with our issues. Search is a critical functionality and we've been wrestling with a few issues for a while. Feels good to get your thoughts!

... This is going to confuse Nominatim because countries are handled special. And Nominatim makes the assumption that there is exactly one country version of each country. You probably have more than one. At this point we should have a chat about how you expect the historical data to be handled by Nominatim when it is creating addresses. Depending on the answer we can see how to tweak your instance to handle the data right.

Your guesses are spot on - we definitely need more than 1 version of a country. The US alone probably has 50+ versions. Where can we go to find out more about how Nominatim handles countries? Also, how does it handle states? I'm now realizing that the funny behavior we've seen where Seattle or other places end up as locations in "The Republic of Texas" may be the result of some address reconstruction.

From #374: image

  1. Introduce the equivalent of start_date and end_date and move the extra information to separate tags. This is the cleanest solution but not always possible.

Any object (typically relations) with years in the name=* tag should already have start_date and end_date populated.

  1. Establish the convention that the date information must go into brackets: San Marino (1463-present). Nominatim's tokenizer already has a special handling of names with brackets like that and will assume that 'San Marino' is a full name in this case. To be precise, it will add the full names 'San Marino' and 'San Marino 1463-present'. Maybe exactly what you need, maybe not.

This is totally reasonable. The dates are a critical piece of identifying / differentiating information that should always travel with the more common name string. I realize this sounds a bit odd and counter to data best practices, but we cannot guarantee what downstream forms the data will take. If there are some tools that parse only name and not the start and end date info, the resulting entries in that tools' table will be very difficult to differentiate.

In some ways, this is akin to our desire to put source=* tags on every object, as opposed to at the changeset level - the changeset source info is pretty inscrutable to the casual viewer looking at a single OSM object.

jeffreyameyer commented 2 years ago

@lonvia - that all said, what's the best way to get in touch to coordinate a meeting, if you're open to that? Maybe a quick chat to start - 20-30 min?

lonvia commented 2 years ago

Your guesses are spot on - we definitely need more than 1 version of a country. The US alone probably has 50+ versions. Where can we go to find out more about how Nominatim handles countries? Also, how does it handle states? I'm now realizing that the funny behavior we've seen where Seattle or other places end up as locations in "The Republic of Texas" may be the result of some address reconstruction.

Exactly. The problem is not only with country levels but with all address parts. The difference is that with country names it breaks in a way that you can't even search anymore, For other address parts the display output just becomes funny and you might get some odd false positive results.

@lonvia - that all said, what's the best way to get in touch to coordinate a meeting, if you're open to that? Maybe a quick chat to start - 20-30 min?

A chat would be the easiest way forward. Can you send me an email with some time suggestions? I'm in UTC+2.

jeffreyameyer commented 2 years ago

This particular issue appears to be fixed with the upgrade (AWESOME!), although I think we should open another one for the tokenizer or whatever will help with substring searches (e.g. "oregon" doesn't return "oregon territory"

image

New ticket for this: image