elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.22k stars 24.85k forks source link

Highlighting not working at field boundaries with whitespace or non-printing characters #36248

Closed tmo-trustpilot closed 5 years ago

tmo-trustpilot commented 5 years ago

Elasticsearch version Version: 6.3.2, Build: default/tar/053779d/2018-07-20T05:20:23.451332Z, JVM: 10.0.2

Plugins installed: ingest-geoip:6.3.2 ingest-user-agent:6.3.2

JVM version (java -version): Not sure, I'm using the docker image docker.elastic.co/elasticsearch/elasticsearch:6.3.2 to reproduce this but java isn't in the $PATH on that.

OS version (uname -a if on a Unix-like system): Linux 77b0d7cbec64 4.9.93-linuxkit-aufs #1 SMP Wed Jun 6 16:55:56 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Filtering for a search string and using highlight with either a non-printing character (eg. '\a') or a whitespace character ('\r' or ' ') will not include the first highlight tag if the matching text is at the start of the string. I believe it also occurs with the closing tag if the result is the end of the search string.

This leads to an unmatched closing tag in the search results. I expect that the starting tag of the highlighting should be included in the highlight result.

In the example below I have number_of_fragments of 0 but it also occurs with fragment_size set instead. In our use case we're using '\a' from the python client as our delimiter which has the same effect. I can't work out how to escape that properly in CURL for reproducing, but the same thing is happening with '\r'.

Steps to reproduce:

This script will reproduce the issue it will delete an index called sample_index if you run it. It shows:

  1. Searching for the middle of the text with "\r" as tags (which works)
  2. Seaching for the start of the text with "$" as tags (which works)
  3. Searching for the start of the text with "\r" as tags (which doesn't work)
#!/bin/sh

echo Deleting any existing index
curl -s -X DELETE localhost:9200/sample_index
echo

echo Creating new index
curl -s -X PUT localhost:9200/sample_index -H 'Content-Type: application/json' -d' 
{ 
  "mappings": {
    "sample": {
      "properties": {
          "SampleField": { "type": "text" }
      }
    }
  }
}
'
echo

echo Inserting new document
curl -s -X POST localhost:9200/sample_index/sample?refresh -H 'Content-Type: application/json' --data-binary \
  '{ "SampleField": "Elastic is great for search" }'
echo

echo Searching for text in middle
curl -s -X GET localhost:9200/_search -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "filter": [{
        "query_string": { "fields": [ "SampleField" ], "query": "great" }
      }]
    }
  },
  "highlight": {
    "pre_tags": [ "\r" ],
    "post_tags": [ "\r" ],
    "fields": { "SampleField": {} },
    "number_of_fragments": 0
  }
}
' | python -m json.tool
# hits.hits.highlight.SampleField: "Elastic is \rgreat\r for search"

echo Searching for text at start with plain text delimiters
curl -s -X GET localhost:9200/_search -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "filter": [{
        "query_string": { "fields": [ "SampleField" ], "query": "Elastic" }
      }]
    }
  },
  "highlight": {
    "pre_tags": [ "$" ],
    "post_tags": [ "$" ],
    "fields": { "SampleField": {} },
    "number_of_fragments": 0
  }
}
' | python -m json.tool
# hits.hits.highlight.SampleField: "$Elastic$ is great for search"

echo Searching for text at start with newline delimiters
curl -s -X GET localhost:9200/_search -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "filter": [{
        "query_string": { "fields": [ "SampleField" ], "query": "Elastic" }
      }]
    }
  },
  "highlight": {
    "pre_tags": [ "\r" ],
    "post_tags": [ "\r" ],
    "fields": { "SampleField": {} },
    "number_of_fragments": 0
  }
}
' | python -m json.tool
# hits.hits.highlight.SampleField: "Elastic\r is great for search"

Provide logs (if relevant): Nothing interesting shows up

elasticmachine commented 5 years ago

Pinging @elastic/es-search

romseygeek commented 5 years ago

Hi @tmo-trustpilot, thanks for opening an issue. The elasticsearch PassageFormatter trims whitespace from the edges of snippets, to prevent results like is <b>great</b> for search with a leading space being returned. We should probably reject pre-tag and post-tag values that would get removed by this. As a workaround, you can use different tags and then replace them in the client if you need to use /r.

tmo-trustpilot commented 5 years ago

Thanks, yeah that makes sense. I wouldn't expect that to apply when number_of_fragments is set to zero which returns the whole field though, or in the case of non-printing characters like '\x07'.

We will use another replacement string in the mean time.

mayya-sharipova commented 5 years ago

Closing this issue, as there is a workaround for this