Framstag / libosmscout

Libosmscout is a C++ library for offline map rendering, routing and location lookup based on OpenStreetMap data
Other
263 stars 81 forks source link

case study: several streets in villages are not assigned to the corresponding village #191

Closed rinigus closed 7 years ago

rinigus commented 7 years ago

When studying import of Estonia closer while working on geocoder, I observed that several villages are stripped from their streets and streets are assigned to the corresponding county. Let's take as an example

http://www.openstreetmap.org/search?query=haava%20tee%204%2C%20Kaberneeme%20k%C3%BCla#map=19/59.51832/25.27578

In libosmscout import, the street "Haava tee" is under "Jõelähtme vald" (as observed in location_full.txt and in generated SQLite database). As a result, we cannot search for "Haava tee, Kaberneeme küla".

I haven't looked into why the village is stripped of the streets. Note that this is not isolated case and are quite a few other villages like that.

Maybe you could give some tips or know the reason? Map was imported using estonia-latest.osm.pbf from http://download.geofabrik.de/europe/estonia.html

Karry commented 7 years ago

Hi. Just a quick question. Is village defined as administrative region in OSM data, or it is just a single point?

Framstag commented 7 years ago

Assigning an address and a location to a region only works, if there is an area-like object found with the region name. An area-like object can either be a administrative boundary or a place type (by marking it as ADMIN_REGION):

In case of villages: TYPE place_village = NODE AREA ("place"=="village") {Name, NameAlt} ADMIN_REGION

In this case https://www.openstreetmap.org/relation/352071 should be the relation that defines the administrative boundary for Kaberneeme küla.

The location index generation will build a tree from the administrative boundaries and places and then will look the "deepest region in the tree" for the given geo coord location.

I would have to make a closer look why in this case this did not work. It is possible that the two outer region the boundary at least obviously has, result in a problem. The first task is, to see, if the region was correctly detected and inserted into the tree. The next step would be, to see why the address was not sorted into this region.

Framstag commented 7 years ago

[tim:~/projects/OSMScout/Demos] master(+51/-0,247)+* 127 ± src/LocationLookup ../maps/estonia "Kaberneeme küla" = Kaberneeme küla

Looks correct to me (though there might still be the two outer problem)

rinigus commented 7 years ago

Its defined as a region. Otherwise it would be unserstandable. Notice that OSM search is able to resolve and find the address correctly

The village seems to be assigned to the area, but the streets of the village are not assigned to the village, for some reason...

Framstag commented 7 years ago

I still assume that the multiple outer areas might be the problem. I'll analyse further, as soon as I have some time.

rinigus commented 7 years ago

Thank you!

rinigus commented 7 years ago

When looking closer, it looks that Jõelahtme vald is missing for the village (compare with OSM link result). So, maybe these borders do make the library confused and the village is pushed into a level above (ssame as vald).

Framstag commented 7 years ago

It looks like the region tree is already "broken" in this special case (and most likely also in mor egeneral if a region has multiple areas).

If have not yet done actual debugging but looking at the region tree the hierachy is:

Kaberneeme küla/Harju maakond/Eesti

Reason for this is likely, that for the both areas of Kaberneeme küla one common boundingBox is calculated. And it looks like this boundingBox is more essential for building the tree. And since the common boundingbox of the sum of both regions is big, it might be that is sorted in higher in the hierachy than it should. Everything else is likely just a consequence...

Easy solution would be to insert the region once for each area (instead of one region with a list of areas). But I'm unsure what the consequences would be. Need some (continuous) time to think about it and test it (sound like "weekend"). However t should be easy to try out, if you are interested....

Framstag commented 7 years ago

It should be easy to see what actually happens by dumping some debug output (region name and comparison result) into

LocationIndexGenerator::AddRegion()
Framstag commented 7 years ago

After thinking about it, IMHO assigning a region with a different area multiple times is wrong. From the mapping view assigning a location/address based on containment is wrong. because of corner cases a location/address should always be assigned to the region explicitly named, even if there is no containment. I assume we will fix this in future. If we now have regions multiple times, we will never be sure which of the region we have to assign a location/address.

I will now refactor the code so that every area has its own bounding box instead of the region itself. I hope, this will work out and will work better.

rinigus commented 7 years ago

I must say, you lost me here. Let me try to ask the questions that could make it clearer. I don't see where do we "assign a region with a different area multiple times". Would you mind to bring out an example?

As for explicitly named regions: do you mean use of is_in tag in OSM? As far as I could see, its just a name that does not even have to match anything (not some OSM object/relation ID).

What's also confusing for me is that you relate to AREA and REGION as a different entities. I guess, I always assumed they are the same.

Framstag commented 7 years ago

We currently do not. Currently a region can consist of multiple areas. All areas are assigned to one region object and this region object has one parent.

To fix the problem I was thinking about, changing this to having one area per region but instantiate the region multiple times with (possibly) different parents.

Yes, I mean the is_in tag. Yes, I know that in the past it was of dubious quality but IMHO it should always be checked first. There are corner cases, where an address is not in a given area but still belong to that region. This especially true for object along the border of regions. The placement of the entry can result in such things like "belongs to region B but because the entry leaves to street "a" and street "a" belong to region A it is assigned to region A". We currently do not check "in_in" at all.

A administrative boundary is normally a area or relation. A relation can have multiple outer rings. This is the case where the city has not one area but somehow "island". Such relation could also have inner rings, with then are cutouts and mark sub areas that do not belong to the region. We currently do not handle such cutouts either.

rinigus commented 7 years ago

OK, I see. Thank you for explanation, it helped together with looking into importer. Maybe there is no need to do massive changes. As it is now - regions as collection of areas - we have correspondence between admin division and its representation (as you brought out the islands...). So, maybe this should stay and we would have to make algorithm of checking a bit more robust, i.e tolerate floating point imperfections as well as the data imperfections.

As a start, using is_in would be great. That you could immediately relate to your Region class and let's see if that would help a lot already. I was going to suggest is_in as well and maybe it should be considered as a first check with the current checks (and their improvement) done only if is_in fails.

Framstag commented 7 years ago

I did some changes, that at least change the result. Can you please take a look if things are better now?

rinigus commented 7 years ago

Thank you! Sure, I'll start testing right now

rinigus commented 7 years ago

Unfortunately, the case is still not resolved: Kaberneeme küla is still not associated with Jõelähtme vald, but is on an administrative level above. Maybe the problem is in a small island visible at http://www.openstreetmap.org/relation/352071 ? It still belongs to Jõelähtme vald http://www.openstreetmap.org/relation/350208#map=13/59.5117/25.3149 , but it is a bit out of the way...

rinigus commented 7 years ago

BTW, is_in is correctly pointing to the admin relationships, as expected.

Framstag commented 7 years ago

I will take a further look but fear that this can only be solved by evaluating is_in and also addr:city on the address.

Framstag commented 7 years ago

I think it makes sense, to split this into two futher issues (is_in, and addr:city). We can keep this issue open nevertheless.

rinigus commented 7 years ago

Meanwhile, I'll try to catch where is the split failing. That way maybe we can approach the problem from different angles.

rinigus commented 7 years ago

I traced the problem to https://github.com/Framstag/libosmscout/blob/master/libosmscout-import/src/osmscout/import/GenLocationIndex.cpp#L472 (call IsAreaSubOfAreaQuorum), as should have been expected. I'll work on it further and see if a different implementation of area check would work better.

Tim, can you tell whether each area represented as vectors is expected to be a proper closed area? As you mentioned earlier, we don't even cover yet regions with holes in them (like hole corresponding to another region), right? So, we could assume that the region does not have holes in the check.

Framstag commented 7 years ago

For region that are represented by one area a child region should be 100% enclosed by its parent region. IT would be strange if a suburb is somehow bigger than the enclosing city. Sadly this precision is not always in the current OSM data. I had examples where there was slight overlapping. Thus the Quorum method was introduced, that allows some variance.

Regarding "being closed". Yes, the list of nodes that represent the area should be closed it that the last node and the first node close the given area. It is an area, not a way. It should also be normally shaped so e.g. the border should not intersect itself. However this is OSM, mapping error are always possible. So I would at lest expect that e.g. hourglass shapes could happen (instead of mapping this as two disjunct areas).

The problem we have likely occurs if a region consists of multiple areas. In this cases situation is not clear anymore. And the right containment may even depend on the order of objects during tree building.

Holes are not that much of an problem if for the hole there (likely) also exists a different boundary that defines this sub region matching the hole. Since we always assign the node to the leaf of the tree we would still assign the address to this sub region.

My pushes changes do mainly the following:

IMHO these are all algorithmic fixes that will improve the situation. Major improvements will however likely come only by using containment only as fall back and using addr:city and is_in instead for a first try. This is not difficult in itself. But some amount of code has the get written and tested and integration into the existing code, that is already rather complex. Perhaps doing some refactoring of the existing code (more OO mechnism, better naming, overhault of data structures) should possibly done first.

Hope this helps.

rinigus commented 7 years ago

I just submitted PR fixing this case.

One general question: should we replace all coordinate comparisons with some inequality allowing the values to differ within some tolerance? I think that could make a difference in some cases.

rinigus commented 7 years ago

I guess we can close this issue after merging https://github.com/Framstag/libosmscout/pull/237