elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.91k stars 24.73k forks source link

Highlighting Error with `span_field_masking` Requires Indexing Offsets Unexpectedly #101804

Open ahoogol opened 11 months ago

ahoogol commented 11 months ago

Elasticsearch Version

8.10.4

Installed Plugins

No response

Java Version

bundled

OS Version

Elastic Cloud - GCP - Iowa (us-central1)

Problem Description

I encountered an issue when using the span_field_masking feature in Elasticsearch. When attempting to use the highlighter with this feature, the following error is thrown:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "field 'text' was indexed without offsets, cannot highlight"
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "test_mask",
        "node": "jUZ9p0ZtR6-xYevegW6O_Q",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "field 'text' was indexed without offsets, cannot highlight"
        }
      }
    ],
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "field 'text' was indexed without offsets, cannot highlight",
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "field 'text' was indexed without offsets, cannot highlight"
      }
    }
  },
  "status": 400
}

If I set "index_options": "offsets" in the mapping of the masked field 'stem', highlighting works as expected. However, I'm puzzled as to why the highlighter requires indexing offsets. I'd like to understand why the highlighter doesn't re-analyze the text to calculate offsets dynamically. My concern is that indexing offsets increases the index size, which I want to avoid.

Steps to Reproduce

PUT test_mask
{
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "analyzer": "whitespace"
      },
      "stem": {
        "type": "text",
        "analyzer": "whitespace"
      }
    }
  }
}

PUT test_mask/_doc/1
{
  "text": "a _ a b",
  "stem": "_ b _ _"
}

GET test_mask/_search
{
  "query": {
    "span_near": {
      "clauses": [
        {
          "span_term": {
            "text": {
              "value": "a"
            }
          }
        },
        {
          "span_field_masking": {
            "field": "text", 
            "query": {
              "span_term": {
                "stem": {
                  "value": "b"
                }
              }
            }
          }
        }
      ],
      "slop": 0,
      "in_order": true
    }
  },
  "highlight": {
    "pre_tags": "(", 
    "post_tags": ")", 
    "fields": {
      "*": {}
    },
    "type": "unified"
  }
}

Expected result

I was expecting the highlight to look like this:

"highlight": {
  "text": [
    "(a) (_) a b"
  ]
}
elasticsearchmachine commented 11 months ago

Pinging @elastic/es-search (Team:Search)

benwtrent commented 11 months ago

This is due to highlight.weight_matches_mode.enabled. I am not 100% sure why we are trying to get the offsets here.

But, to get around this bug,

PUT test_mask/_settings
{
  "index" : {
    "highlight.weight_matches_mode.enabled" : "false"
  }
}

Need to still dig into the correct fix here.

benwtrent commented 11 months ago

error-trace:

java.lang.IllegalArgumentException: field 'text' was indexed without offsets, cannot highlight
  at org.apache.lucene.highlighter@9.8.0/org.apache.lucene.search.uhighlight.FieldHighlighter.highlightOffsetsEnums(FieldHighlighter.java:157)
  at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.lucene.search.uhighlight.CustomFieldHighlighter.highlightOffsetsEnums(CustomFieldHighlighter.java:106)
  at org.apache.lucene.highlighter@9.8.0/org.apache.lucene.search.uhighlight.FieldHighlighter.highlightFieldForDoc(FieldHighlighter.java:83)
  at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.lucene.search.uhighlight.CustomFieldHighlighter.highlightFieldForDoc(CustomFieldHighlighter.java:63)
  at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.lucene.search.uhighlight.CustomUnifiedHighlighter.highlightField(CustomUnifiedHighlighter.java:148)
  at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.search.fetch.subphase.highlight.DefaultHighlighter.highlight(DefaultHighlighter.java:81)
  at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.search.fetch.subphase.highlight.HighlightPhase$1.process(HighlightPhase.java:69)
  at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.search.fetch.FetchPhase$1.nextDoc(FetchPhase.java:163)
  at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.search.fetch.FetchPhaseDocsIterator.iterate(FetchPhaseDocsIterator.java:70)
  at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.search.fetch.FetchPhase.buildSearchHits(FetchPhase.java:169)
  at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:78)
  at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:711)
  at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:682)
  at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:543)
  at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.action.ActionRunnable$2.accept(ActionRunnable.java:51)
  at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.action.ActionRunnable$2.accept(ActionRunnable.java:48)
  at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.action.ActionRunnable$3.doRun(ActionRunnable.java:73)
  at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
  at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
  at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983)
  at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
  at java.base/java.lang.Thread.run(Thread.java:1583)
ahoogol commented 11 months ago

This is due to highlight.weight_matches_mode.enabled. I am not 100% sure why we are trying to get the offsets here.

But, to get around this bug,

PUT test_mask/_settings
{
  "index" : {
    "highlight.weight_matches_mode.enabled" : "false"
  }
}

Need to still dig into the correct fix here.

@benwtrent Thank you for your suggestion. While running your suggested command, the error no longer occurs. However, I've noticed that the generated highlight doesn't match my expected output.

With your command:

"highlight": {
   "text": [
     "(a) _ (a) b"
   ],
   "stem": [
     "_ (b) _ _"
   ]
 }

I was expecting the highlight to look like this:

"highlight": {
  "text": [
    "(a) (_) a b"
  ]
}

Is there a way to achieve this expected result while avoiding the error?

benwtrent commented 11 months ago

@ahoogol turn on offsets for the fields and use "highlight.weight_matches_mode.enabled" : "true"

ahoogol commented 11 months ago

Thank you for your suggestion, @benwtrent. Yes, it highlights correctly when enabling offsets. But, my concern remains about the increase in index size. I'm still exploring alternative approaches to achieve the desired highlight without the need to turn on offsets to keep the index size manageable. If you have any further insights or suggestions, they would be greatly appreciated.

mayya-sharipova commented 10 months ago

@ahoogol If you use "require_field_match" : false as a highlighter option, you will get expected results without enabling offsets.

"highlight": {
    "require_field_match" : false,
    "pre_tags": "(", 
    "post_tags": ")", 
    "fields": {
      "*": {}
    },
    "type": "unified"
  }

Why it breaks is that internally we check that we the field we highlight on "text" is the same that the field that has matches "stem", but in this case there are different. That's the failure.

mayya-sharipova commented 10 months ago

I will add this to documentation for span_field_masking query and will close this issue.

ahoogol commented 10 months ago

@mayya-sharipova I included "require_field_match": false in the highlighter options, but the resulting output remains different from what I expected:

Your suggestion output: (i tested it in 8.10.0 and 8.11.3)

"highlight": {
  "text": [
    "a _ (a) (b)"
  ]
}

Expected output:

"highlight": {
  "text": [
    "(a) (_) a b"
  ]
}
mayya-sharipova commented 10 months ago

@ahoogol Indeed you are right about the expected behaviour, but it is not supported on span_field_masking query. And it would be not easy to support it (without indexing with offsets).

The highlighting behaviour that you expect is based on Matches and was added from 8.10. But it relies on the fact that the highlighted field contains query terms, which is not your case.


I have added a documentation clarifying that span_field_masking query has unexpected highlighting behaviour and should be used with require_field_match = false.

I also modified the type of this issue as a "feature", that we may tackle sometime in the future.

elasticsearchmachine commented 3 months ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)