elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.7k stars 24.66k forks source link

Regexp parsing failure for empty OR inside a group #66159

Open v-echo opened 3 years ago

v-echo commented 3 years ago

Elasticsearch version (bin/elasticsearch --version): 7.6.2 JVM version (java -version): Embedded OS version (uname -a if on a Unix-like system): Windows Server 2016 Description of the problem including expected versus actual behavior: When executing the following query:

"regexp":{
     "content":{
          "value":"[0-9](\\/|\\:| |)[^aboiyzABOIYZ0-9\\[-\\` -@](\\/|\\:| |)[0-9]{2,}"
     }
}

The following error is returned: err0

Testing the same expression on https://regex101.com/ it renders and matches correctly. Reading the docs at https://www.elastic.co/guide/en/elasticsearch/reference/current/regexp-syntax.html it is unclear what the problem is. However, on further testing it seems like if you remove or escape the last | inside the groups it parses, though naturally it doesn't match correctly anymore, since the meaning of the symbol changes.

The 'content' field mapping is simple, almost default: mapping

For reference/testing, what the expression should match are UTM coordinates. Taken from here:

utmMatches = ['4Q6109372363778', '1f21', '2e01928391087509127405123521353526798', 
                '4:Q6109372363778', '4/Q/6109372363778', '4 Q 6109372363778', '1 e 231']
utmFails = ['4a6109372363778', 'asljkd', '123f', '1a21', '1234', '1/2:123', '1:./21', '1 a 1234']
danielmitterdorfer commented 3 years ago

Here is a reproduction scenario:

# (1) create an index with a text property
curl -X PUT "localhost:9200/es-66159?pretty" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "content": { "type": "text" }
    }
  }
}
'

# (2) index a document

# store as doc.json
{
  "content": "['4Q6109372363778', '1f21', '2e01928391087509127405123521353526798', '4:Q6109372363778', '4/Q/6109372363778', '4 Q 6109372363778', '1 e 231']"
}

curl -X POST "localhost:9200/es-66159/_doc?pretty" -H 'Content-Type: application/json' --data-binary "@doc.json"

# (3) issue regexp query
curl -X GET "localhost:9200/es-66159/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "regexp": {
      "content": {
        "value":"[0-9](\\/|\\:| |)[^aboiyzABOIYZ0-9\\[-\\` -@](\\/|\\:| |)[0-9]{2,}"
      }
    }
  }
}
'
elasticmachine commented 3 years ago

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)