drolbr / Overpass-API

A database engine to query the OpenStreetMap data.
http://overpass-api.de
GNU Affero General Public License v3.0
690 stars 90 forks source link

Query for unicode range \u036E-\u036F returns non-matching results #688

Open ZeLonewolf opened 1 year ago

ZeLonewolf commented 1 year ago

The following query for a range of two consecutive unicode values returns 5,747 city nodes, however, none of the returned results actually appear to contain either character.

[out:csv(::id, name)][timeout:2500];
node[place=city][name~"[\u036E-\u036F]"];
out;

Queries for each character individually each return zero results:

[out:csv(::id, name)][timeout:2500];
node[place=city][name~"\u036E"];
out;
[out:csv(::id, name)][timeout:2500];
node[place=city][name~"\u036F"];
out;
mmd-osm commented 1 year ago

332 is probably related...

Also note that \u needs a bit more escaping here: \\u

1ec5 commented 1 year ago

A query for node[place=city][name~"[\u1ebf]"] (with just one backslash) does return two cities that contain this combining character (because editors and imports at the time didn’t normalize the text to NFC). Expanding the range to U+0300 to U+036F correctly returns this node.

1ec5 commented 1 year ago

Oh, I just got lucky because the city names happened to contain some of the letters in the hexadecimal numbers in the range. Never mind me.

mmd-osm commented 1 year ago

So based on U+1EBF, I'm getting the following three place=city nodes (with proper unicode regex support):

  <node id="369487050"/>
  <node id="369487099"/>
  <node id="3140507587"/>
ZeLonewolf commented 1 year ago

I note that even with the escaping fixed, I still get (different) non-sensical results:

[out:csv(::id, name)][timeout:2500];
node[place=city][name~"[\\u036E-\\u036F]"];
out;
mmd-osm commented 1 year ago

Right, I've noticed the missing backslash when revisiting #332. In the end it doesn't make a whole lot of a difference, since the underlying regular expression implementation doesn't handle ranges as expected.

I hope you received some link to a github gist to try out another implementation that works a bit better.