drolbr / Overpass-API

A database engine to query the OpenStreetMap data.
http://overpass-api.de
GNU Affero General Public License v3.0
717 stars 90 forks source link

Invalid query results with regexp filter conditions #192

Closed mmd-osm closed 5 years ago

mmd-osm commented 9 years ago

According to the Wiki:

"The result set is the set of all elements that match the conditions of all the filters. "

Let's take a look at the following examples:

Statement Way 271221475 is ... by out geom;
way(271221475); shown
way(271221475) ["addr:city"]; shown
way(271221475) ["addr:city"] ["addr:hamlet"!~"."]; not shown
way(271221475) ["addr:city"] ["addr:hamlet"!~"."] [~"^addr:hamlet:.*$"~"."]; shown

In example 3, the condition ["addr:hamlet"!~"."] is not met. To my surprise, adding another key-regexp filter in example 4 overrides this. Is this a bug, undocumented behavior or did I miss something here?

Reference data:

  <way id="271221475">
    <bounds minlat="46.6877029" minlon="11.8496157" maxlat="46.6878615" maxlon="11.8498448"/>
    <nd ref="2761802524" lat="46.6877984" lon="11.8496157"/>
    <nd ref="2761802876" lat="46.6877710" lon="11.8496460"/>
    <nd ref="2761802583" lat="46.6877600" lon="11.8496249"/>
    <nd ref="2761802032" lat="46.6877178" lon="11.8496716"/>
    <nd ref="2761802205" lat="46.6877310" lon="11.8496969"/>
    <nd ref="2761802372" lat="46.6877029" lon="11.8497280"/>
    <nd ref="2761802069" lat="46.6877638" lon="11.8498448"/>
    <nd ref="2761802026" lat="46.6878615" lon="11.8497366"/>
    <nd ref="2761802524" lat="46.6877984" lon="11.8496157"/>
    <tag k="addr:city" v="San Martin de Tor - St.Martin in Thurn - S.Martino in Badia"/>
    <tag k="addr:city:de" v="St.Martin in Thurn"/>
    <tag k="addr:city:it" v="S.Martino in Badia"/>
    <tag k="addr:city:lld" v="San Martin de Tor"/>
    <tag k="addr:hamlet" v="Antermoia"/>
    <tag k="addr:hamlet:de" v="Untermoj"/>
    <tag k="addr:hamlet:it" v="Antermoia"/>
    <tag k="addr:hamlet:lld" v="Antermoia"/>
    <tag k="addr:housenumber" v="40"/>
    <tag k="addr:postcode" v="39030"/>
    <tag k="addr:street" v="Strada Plan Murin"/>
    <tag k="addr:street:de" v="Straße Plan Murin"/>
    <tag k="addr:street:it" v="Strada Plan Murin"/>
    <tag k="addr:street:lld" v="Strada Plan Murin"/>
    <tag k="building" v="yes"/>
  </way>
drolbr commented 9 years ago

Thank you for finding the issue. That's a bug.

mmd-osm commented 9 years ago

Test query: node(1530957491)["name"!~"."][~"^name:.*$"~"."];out;

Bad behavior is caused by coding in void filter_ids_by_ntags in query.cc in Line 608:

  sort(removed_ids.begin(), removed_ids.end());
  removed_ids.erase(unique(removed_ids.begin(), removed_ids.end()), removed_ids.end());
  new_ids.erase(set_difference(new_ids.begin(), new_ids.end(),
                removed_ids.begin(), removed_ids.end(), new_ids.begin()), new_ids.end());

Situation before new_ids.erase: new_ids: (6 entries): 1530957491 1530957491 1530957491 1530957491 1530957491 1530957491

Situation after new_ids.erase: new_ids: (5 entries): 1530957491 1530957491 1530957491 1530957491 1530957491

Although the object ids are all identical, the erase statement fails to delete all relevant ones - it just deletes a single entry. Effectively, way 1530957491 still survives in the result, although it has been identified to be removed by the negative key/value_regex.

The following pull request fixes this by first removing duplicate entries in new_ids:

 sort(new_ids.begin(), new_ids.end());
 new_ids.erase(unique(new_ids.begin(), new_ids.end()), new_ids.end());

This fix is also deployed for testing on the dev instance: http://overpass-turbo.eu/s/b03

mmd-osm commented 8 years ago

Issue was fixed in https://github.com/drolbr/Overpass-API/commit/393581016b2b0ffa244bd84debfc02fe69c33386

mmd-osm commented 8 years ago

I'm reopening this issue, as the patch in https://github.com/drolbr/Overpass-API/commit/393581016b2b0ffa244bd84debfc02fe69c33386 causes performance regressions. The following very simple query used to take 50ms, and now takes 500ms.

Test case:

[out:json];node["name"~"[33][22][22][55]"]["phone"](3.2299951,-76.6815258,3.6299951000000004,-76.2815258);out body;

With an additional instrumentation in method generate_ids_by_coarse, we can check the index used for _ids_bycoarse, as well as the vector size before / after removing duplicates. Results are shown below.

It's pretty evident, that we're sorting and removing the same elements (plus a few new ones) over and over again for the same coarse index.

I'm proposing to change this in a way that sorting / removing duplicates is only done once per it->first.val() & 0x7fffff00. See referenced pull request.

ids_by_coarse - index: 461641984
Before de-dup: 38
After de-dup: 38
ids_by_coarse - index: 461641984
Before de-dup: 63
After de-dup: 63
ids_by_coarse - index: 461641984
Before de-dup: 100
After de-dup: 100
ids_by_coarse - index: 461641984
Before de-dup: 103
After de-dup: 103
ids_by_coarse - index: 461641984
Before de-dup: 155
After de-dup: 155
ids_by_coarse - index: 461641984
Before de-dup: 163
After de-dup: 163
ids_by_coarse - index: 461641984
Before de-dup: 202
After de-dup: 202
ids_by_coarse - index: 461641984
Before de-dup: 250
After de-dup: 250
ids_by_coarse - index: 461641984
Before de-dup: 281
After de-dup: 281
ids_by_coarse - index: 461641984
Before de-dup: 319
After de-dup: 319
ids_by_coarse - index: 461641984
Before de-dup: 450
After de-dup: 450
ids_by_coarse - index: 461641984
Before de-dup: 458
After de-dup: 458
ids_by_coarse - index: 461641984
Before de-dup: 531
After de-dup: 531
ids_by_coarse - index: 461641984
Before de-dup: 554
After de-dup: 554
ids_by_coarse - index: 461641984
Before de-dup: 604
After de-dup: 604
ids_by_coarse - index: 461641984
Before de-dup: 629
After de-dup: 629
ids_by_coarse - index: 461641984
Before de-dup: 674
After de-dup: 674
ids_by_coarse - index: 461641984
Before de-dup: 744
After de-dup: 744
ids_by_coarse - index: 461641984
Before de-dup: 781
After de-dup: 781
ids_by_coarse - index: 461641984
Before de-dup: 866
After de-dup: 866
ids_by_coarse - index: 461641984
Before de-dup: 890
After de-dup: 890
ids_by_coarse - index: 461641984
Before de-dup: 934
After de-dup: 934
ids_by_coarse - index: 461641984
Before de-dup: 1005
After de-dup: 1005
ids_by_coarse - index: 461641984
Before de-dup: 1027
After de-dup: 1027
ids_by_coarse - index: 461641984
Before de-dup: 1129
After de-dup: 1129
ids_by_coarse - index: 461641984
Before de-dup: 1266
After de-dup: 1266
ids_by_coarse - index: 461641984
Before de-dup: 1391
After de-dup: 1391
ids_by_coarse - index: 461641984
Before de-dup: 1494
After de-dup: 1494
ids_by_coarse - index: 461641984
Before de-dup: 1586
After de-dup: 1586
ids_by_coarse - index: 461641984
Before de-dup: 1698
After de-dup: 1698
ids_by_coarse - index: 461641984
Before de-dup: 1713
After de-dup: 1713
ids_by_coarse - index: 461641984
Before de-dup: 1759
After de-dup: 1759
ids_by_coarse - index: 461641984
Before de-dup: 1799
After de-dup: 1799
ids_by_coarse - index: 461641984
Before de-dup: 1835
After de-dup: 1835
ids_by_coarse - index: 461641984
Before de-dup: 1904
After de-dup: 1904
ids_by_coarse - index: 461641984
Before de-dup: 1909
After de-dup: 1909
ids_by_coarse - index: 461641984
Before de-dup: 1950
After de-dup: 1950
ids_by_coarse - index: 461641984
Before de-dup: 1968
After de-dup: 1968
ids_by_coarse - index: 461641984
Before de-dup: 1972
After de-dup: 1972
ids_by_coarse - index: 461641984
Before de-dup: 2055
After de-dup: 2055
ids_by_coarse - index: 461641984
Before de-dup: 2078
After de-dup: 2078
ids_by_coarse - index: 461641984
Before de-dup: 2087
After de-dup: 2087
ids_by_coarse - index: 461641984
Before de-dup: 2094
After de-dup: 2094
ids_by_coarse - index: 461641984
Before de-dup: 2098
After de-dup: 2098
ids_by_coarse - index: 461641984
Before de-dup: 2108
After de-dup: 2108
ids_by_coarse - index: 461641984
Before de-dup: 2109
After de-dup: 2109
ids_by_coarse - index: 461641984
Before de-dup: 2110
After de-dup: 2110
ids_by_coarse - index: 461641984
Before de-dup: 2116
After de-dup: 2116
ids_by_coarse - index: 461641984
Before de-dup: 2122
After de-dup: 2122
ids_by_coarse - index: 461641984
Before de-dup: 2272
After de-dup: 2272
ids_by_coarse - index: 461641984
Before de-dup: 2320
After de-dup: 2320
ids_by_coarse - index: 461641984
Before de-dup: 2358
After de-dup: 2358
ids_by_coarse - index: 461641984
Before de-dup: 2390
After de-dup: 2390
ids_by_coarse - index: 461641984
Before de-dup: 2448
After de-dup: 2448
ids_by_coarse - index: 461641984
Before de-dup: 2499
After de-dup: 2499
ids_by_coarse - index: 461641984
Before de-dup: 2542
After de-dup: 2542
ids_by_coarse - index: 461641984
Before de-dup: 2556
After de-dup: 2556
ids_by_coarse - index: 461641984
Before de-dup: 2565
After de-dup: 2565
ids_by_coarse - index: 461641984
Before de-dup: 2571
After de-dup: 2571
ids_by_coarse - index: 461641984
Before de-dup: 2608
After de-dup: 2608
ids_by_coarse - index: 461641984
Before de-dup: 2615
After de-dup: 2615
ids_by_coarse - index: 461641984
Before de-dup: 2623
After de-dup: 2623
ids_by_coarse - index: 461641984
Before de-dup: 2626
After de-dup: 2626
ids_by_coarse - index: 461641984
Before de-dup: 2640
After de-dup: 2640
ids_by_coarse - index: 461641984
Before de-dup: 2673
After de-dup: 2673
ids_by_coarse - index: 461641984
Before de-dup: 2690
After de-dup: 2690
ids_by_coarse - index: 461641984
Before de-dup: 2705
After de-dup: 2705
ids_by_coarse - index: 461641984
Before de-dup: 2712
After de-dup: 2712
ids_by_coarse - index: 461641984
Before de-dup: 2743
After de-dup: 2743
ids_by_coarse - index: 461641984
Before de-dup: 2800
After de-dup: 2800
ids_by_coarse - index: 461641984
Before de-dup: 2812
After de-dup: 2812
ids_by_coarse - index: 461641984
Before de-dup: 2817
After de-dup: 2817
ids_by_coarse - index: 461641984
Before de-dup: 2821
After de-dup: 2821
ids_by_coarse - index: 461641984
Before de-dup: 2888
After de-dup: 2888
ids_by_coarse - index: 461641984
Before de-dup: 3085
After de-dup: 3085
ids_by_coarse - index: 461641984
Before de-dup: 3093
After de-dup: 3093
ids_by_coarse - index: 461641984
Before de-dup: 3140
After de-dup: 3140
ids_by_coarse - index: 461642240
Before de-dup: 4
After de-dup: 4
drolbr commented 8 years ago

It is now an enhancement because it is a merely question of performance.

mmd-osm commented 8 years ago

Closing, as issue was fixed in c30c154df701d20cf1ec3e9274910d9dce31ce7d..e6394bff10c7ea8a9c68c876854b4b13feb85a3d

Thanks!

mmd-osm commented 7 years ago

I'm reopening this issue, as there's a strange effect with the Karlsruhe node.

I'm testing with keyregexp and a Cyrillic character 'л':

node(49.01342968289611,8.403076529502867,49.015037577534585,8.406440019607544)[!name][~"."~"л"];
out;

/*
Test:
  nodes: 21487097  -> ok        (  [!name] removes node  )
         240120582 -> not ok    (  [!name] has no impact )
*/   

http://overpass-turbo.eu/s/pw1

My expectation is that node 240120582 (Karlsruhe) is not shown in the result, because is has a name tag. This works pretty well for another nearby node 21487097 (Mühlburg), which also has a name tag and the same Cyrillic character, i.e. the Regexp itself filters the correct nodes.

This one is even worse, returning lots of nodes with a name tag:

node({{bbox}})[!name][~"."~"a"];
node._[name];
out;

http://overpass-turbo.eu/s/pwz

Also tested on //dev.overpass-api.de/api_new_feat/ with the same effect.

ghost commented 7 years ago

I would like to obtain all nodes in a region that have no name tag but do have name:lang. The following query returns nodes that do have a name tag, even though it shouldn't:

node[!"name"][~"^name:(.*)$"~".*"];
mmd-osm commented 5 years ago

This query returns invalid results depending on zoom level:

http://overpass-turbo.eu/s/Dww

[bbox:{{bbox}}];
node[~"^fhrs.*"~".*"][!"fhrs:id"];
out center;
mmd-osm commented 5 years ago

Always the same reason: filter_ids_by_ntags is getting called with duplicates ids in std::vector< Id_Type >& new_ids

mmd-osm commented 5 years ago

This issue was fixed in https://github.com/drolbr/Overpass-API/commit/2ca3f3877bf64a5b6ceaf91423721b37d61aec3b