Closed danrademacher closed 2 years ago
This is not a simple case of the API not responding that we could try to address on the infrastructure side.
It also does not seem to be some weird bug related to "years in name make substring fail" since this works: https://staging.openhistoricalmap.org/search?query=comancheria#map=7/34.746/-100.629&layers=O&date=1922-01-01&daterange=1922-01-01,2022-12-31
So then what is it about "San Marino 1463-present" that causes the search to return no results?
This is also true of Oregon:
This one has 4 results: https://www.openhistoricalmap.org/search?query=Oregon%20Territory
But this has only one: https://www.openhistoricalmap.org/search?query=Oregon
At least part of the issue seems to be that it is not doing substring search
@batpad who do we know in the larger OSM community who we could reach out to on this one?
It seems clear that the data is in Nominatim's database, so not an issue of data syncing between main DB and Nominatim, but one can only get those results by typing in the exact name to get a result.
Looking at OSM for "oregon" and you get a lot more results: https://www.openstreetmap.org/search?query=oregon#map=13/41.6782/-83.4387
It appears to be doing substring search on, eg, Oregon County, but I wonder if that's really true or if all these items have other searchable tags that are just "oregon"
EG, here's "Oregon county" with an "alt name" of "Oregon" https://www.openstreetmap.org/relation/1180502
But this one seems like a substring: https://nominatim.openstreetmap.org/ui/details.html?osmtype=W&osmid=722118662&class=man_made
https://www.openstreetmap.org/way/722118662
The only visible instance of "Oregon" there is name:etymology:wikipedia | en:Hawthorne Boulevard (Portland, Oregon)
So the question is how can we get our instance of Nominatim to treat name | San Marino 1463-present
the same as OSM Nominatim is treating name:etymology:wikipedia | en:Hawthorne Boulevard (Portland, Oregon)
I discussed this with @batpad and here are notes from that:
Potentially 3 things:
There's folks doing something similar to OSM-seed and they have a Pelias container packaged: https://github.com/headwaymaps/headway/tree/main/services/pelias - it seems like it might not be too much work to set it up and get it to read from our replication and just explore if that gives us better results / maybe more configurable, etc.
I'll check with @geohacker about an intro to Sarah Hoffman. Since OSM is using Nominatim and getting better substring search results, this feels like something that might be solvable without actually getting into Nominatim internals. Though that won't tell us if Pelias might be a better long term search solution...
@danrademacher I added another point in this ticket https://github.com/OpenHistoricalMap/issues/issues/243#issuecomment-1156246662 about admin levels used to construct the display name. I'm not entirely sure it's related but just wanted to link here as we look into this.
I'll try to tag @lonvia here to see if she might have any ideas. I mentioned to Sarah at SOTM that we are using Nominatim for OHM and have some quirks which we aren't entirely sure about.
To summarise, in this particular case:
place=country
node named San Marino https://www.openhistoricalmap.org/node/2090640309#map=20/43.93771/12.46485&layers=OND&date=1463-06-27&daterange=1463-06-27,2022-12-31place=city
node named City of San Marinosan marino
brings up no results.I think we are most certainly misinterpreting Nominatim's behaviour and expecting something that our instance isn't configured properly for. @lonvia, would be great if you have any thoughts or directions you can point us in. Thank you!
That's indeed a problem of partial name matching, i.e. Nominatim having trouble to match san + marino with "San Marino 1463-present". The good news is that much of this is improved in the latest 4.1 version. So updating your installation and reimporting the Nominatim database might already solve that particular issue. However there is a fundamental problem with your names here, which you should look into.
Nominatim has a very heavy bias towards matching against the full strings in the name tag. That can't be changed without ending up with a lot of false positive results. Names like 'San Marino 1463-present', which contain multiple pieces of information, are really bad. There are three ways to solve the problem:
1) Introduce the equivalent of start_date
and end_date
and move the extra information to separate tags. This is the cleanest solution but not always possible.
2) Establish the convention that the date information must go into brackets: San Marino (1463-present)
. Nominatim's tokenizer already has a special handling of names with brackets like that and will assume that 'San Marino' is a full name in this case. To be precise, it will add the full names 'San Marino' and 'San Marino 1463-present'. Maybe exactly what you need, maybe not.
3) Write your own tokenizer. That would be the really advanced version. The newest version of Nominatim allows you to preprocess the names before they are added into the search index, see this tutorial (part 'Write your own sanitizer'). So, whatever conventions you come up with for the name tag, you can write your own parser for that.
The particular result should also be findable via the linked place name but I'd need the link to your Nominatim installation directly to check what is going on in the database. It's quite probable that this has been solved in the newest Nominatim version, too, with https://github.com/osm-search/Nominatim/pull/2637.
Thank you @lonvia for the quick and detailed response! This is great news -- so we can focus on (a) updating to latest Nominatim and (b) keeping dates out of names.
@jeffreyameyer for the dates in names, I think if we want those to appear in various places, we should try to add them from our already existing start_date
and end_date
tags programmatically, like appended to labels in map tiles at generation, or added to sidebar names via Rails or Javascript code. That way the name
stays what it is and dates get added where we want them consistently, without confusing the name
field.
We'll still want to upgrade Nominatim to get best possible results, but this is a good outcome!
@danrademacher , I am going to work on updating nominatim to the latest version!! let see if how that works.
Excellent!
I went ahead with the brackets/parentheses change as the first and easiest thing: https://www.openhistoricalmap.org/changeset/42589#map=13/43.9427/12.4594&layers=OND&date=0301-09-03&daterange=0301-09-03,2022-12-31
But in our current version of Nominatim at least, that didn't make any difference: https://www.openhistoricalmap.org/search?query=San%20Marino#map=17/43.95403/12.40774&layers=O&date=1922-01-01&daterange=1922-01-01,2022-12-31
Also note that we recently got a request to add dates to feature labels in iD, https://github.com/OpenHistoricalMap/issues/issues/430. Not sure how something similar could be done in JOSM. That would help a lot in terms of shoing dates without messing with the name
value
The brackets should have worked even in the old Nominatim version. I see now that there is another problem with San Marino. It is a boundary on admin_level=2 aka country level. This is going to confuse Nominatim because countries are handled special. And Nominatim makes the assumption that there is exactly one country version of each country. You probably have more than one. At this point we should have a chat about how you expect the historical data to be handled by Nominatim when it is creating addresses. Depending on the answer we can see how to tweak your instance to handle the data right.
OK, thanks for the further look @lonvia . @jeffreyameyer we're going to need you to weigh in on this bigger issue of our expectations for Nominatim, esp around countries. It is definitely the case that most if not all countries will have many iterations of their borders over the centuries/millennia that could be covered in OHM. We might want to split this out into another ticket. I'm not quite sure, to be honest, how we clearly define our expectations of "how the historical data is to be handled by Nominatim when it is creating addresses". but Jeff and others have thought more deeply about such higher level questions than I have!
Nominatim container has been updated to 4.1 !! but does not show the san marino results yet.
OK, thanks @Rub21! Based on @lonvia's last note, it sounds like we have additional issues to work through here about how Nominatim treats admin-2/countries. I reassigned this ticket to @jeffreyameyer for follow up on our larger expectations for how Nominatim should be working
Further notes from team discussion
@lonvia - first of all, THANK YOU SO MUCH for showing up to help us with our issues. Search is a critical functionality and we've been wrestling with a few issues for a while. Feels good to get your thoughts!
... This is going to confuse Nominatim because countries are handled special. And Nominatim makes the assumption that there is exactly one country version of each country. You probably have more than one. At this point we should have a chat about how you expect the historical data to be handled by Nominatim when it is creating addresses. Depending on the answer we can see how to tweak your instance to handle the data right.
Your guesses are spot on - we definitely need more than 1 version of a country. The US alone probably has 50+ versions. Where can we go to find out more about how Nominatim handles countries? Also, how does it handle states? I'm now realizing that the funny behavior we've seen where Seattle or other places end up as locations in "The Republic of Texas" may be the result of some address reconstruction.
From #374:
- Introduce the equivalent of
start_date
andend_date
and move the extra information to separate tags. This is the cleanest solution but not always possible.
Any object (typically relations) with years in the name=*
tag should already have start_date
and end_date
populated.
- Establish the convention that the date information must go into brackets:
San Marino (1463-present)
. Nominatim's tokenizer already has a special handling of names with brackets like that and will assume that 'San Marino' is a full name in this case. To be precise, it will add the full names 'San Marino' and 'San Marino 1463-present'. Maybe exactly what you need, maybe not.
This is totally reasonable. The dates are a critical piece of identifying / differentiating information that should always travel with the more common name string. I realize this sounds a bit odd and counter to data best practices, but we cannot guarantee what downstream forms the data will take. If there are some tools that parse only name and not the start and end date info, the resulting entries in that tools' table will be very difficult to differentiate.
In some ways, this is akin to our desire to put source=*
tags on every object, as opposed to at the changeset level - the changeset source info is pretty inscrutable to the casual viewer looking at a single OSM object.
@lonvia - that all said, what's the best way to get in touch to coordinate a meeting, if you're open to that? Maybe a quick chat to start - 20-30 min?
Your guesses are spot on - we definitely need more than 1 version of a country. The US alone probably has 50+ versions. Where can we go to find out more about how Nominatim handles countries? Also, how does it handle states? I'm now realizing that the funny behavior we've seen where Seattle or other places end up as locations in "The Republic of Texas" may be the result of some address reconstruction.
Exactly. The problem is not only with country levels but with all address parts. The difference is that with country names it breaks in a way that you can't even search anymore, For other address parts the display output just becomes funny and you might get some odd false positive results.
@lonvia - that all said, what's the best way to get in touch to coordinate a meeting, if you're open to that? Maybe a quick chat to start - 20-30 min?
A chat would be the easiest way forward. Can you send me an email with some time suggestions? I'm in UTC+2.
This particular issue appears to be fixed with the upgrade (AWESOME!), although I think we should open another one for the tokenizer or whatever will help with substring searches (e.g. "oregon" doesn't return "oregon territory"
New ticket for this:
Bug description This search for San Marino has no results: https://openhistoricalmap.org/search?query=san%20marino#map=7/44.153/-120.515&layers=O&date=1859&daterange=1859,2022
San Marino is quite well mapped: https://www.openhistoricalmap.org/#map=13/43.9399/12.4725&layers=O&date=1900&daterange=1800,2021
Here's a relation with "San Marino" in the name: https://www.openhistoricalmap.org/relation/2692735#map=13/43.9399/12.4510&layers=OND&date=301&daterange=301,2021
We would expect that to be found when searching for "san marino".
This works though:
https://www.openhistoricalmap.org/search?query=San%20Marino%201463-present#map=13/43.9426/12.4594&layers=OND&date=301&daterange=301,2021
this does not: https://www.openhistoricalmap.org/search?query=San%20Marino#map=13/43.9426/12.4594&layers=OND&date=301&daterange=301,2021