elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.02k stars 24.82k forks source link

Highlight field query does not contain highlight(s) after migration to 7.6.2 #70922

Closed seco-mgabor closed 3 years ago

seco-mgabor commented 3 years ago

Elasticsearch version (bin/elasticsearch --version): 7.6.2

Plugins installed: []

JVM version (java -version): 1.8

OS version (uname -a if on a Unix-like system): Ubuntu 5.4.0-70-generic, Elastic running as container

Description of the problem including expected versus actual behavior: I'm trying to migrate from Elastic 6.6.2 to 7.6.2 and the queries containing highlight fields don't work any longer. The same query returns results containing highlighted values in 6.x version, but it doesn't contain any highlight field for 7.6.2. To me it looks like a regression bug.

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including (e.g.) index creation, mappings, settings, query etc. The easier you make for us to reproduce it, the more likely that somebody will take the time to look at it.

Mapping is the following:

{
  "candidate-profiles" : {
    "mappings" : {
      "properties" : {
        "_class" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "availability" : {
          "type" : "keyword"
        },
        "availabilityEndDate" : {
          "type" : "date",
          "format" : "date"
        },
        "drivingCategories" : {
          "type" : "keyword"
        },
        "externalId" : {
          "type" : "keyword"
        },
        "gender" : {
          "type" : "keyword"
        },
        "geoPoint" : {
          "type" : "geo_point"
        },
        "highestDegree" : {
          "type" : "text"
        },
        "highestEducationLevel" : {
          "type" : "integer"
        },
        "id" : {
          "type" : "keyword"
        },
        "jobExperiences" : {
          "type" : "nested",
          "properties" : {
            "degree" : {
              "type" : "integer"
            },
            "education" : {
              "type" : "integer"
            },
            "experience" : {
              "type" : "integer"
            },
            "graduation" : {
              "type" : "keyword"
            },
            "isLastJob" : {
              "type" : "boolean"
            },
            "occupation" : {
              "properties" : {
                "avamCode" : {
                  "type" : "long"
                },
                "bfsCode" : {
                  "type" : "long"
                }
              }
            },
            "remark" : {
              "type" : "search_as_you_type",
              "analyzer" : "ascii_folding",
              "max_shingle_size" : 3
            },
            "wanted" : {
              "type" : "boolean"
            }
          }
        },
        "languages" : {
          "type" : "nested",
          "properties" : {
            "code" : {
              "type" : "text",
              "analyzer" : "language_iso_code_synonym_analyzer"
            },
            "spokenLevel" : {
              "type" : "integer"
            },
            "writtenLevel" : {
              "type" : "integer"
            }
          }
        },
        "preferredWorkCantons" : {
          "type" : "keyword"
        },
        "preferredWorkRegions" : {
          "type" : "keyword"
        },
        "protected" : {
          "type" : "boolean"
        },
        "public" : {
          "type" : "boolean"
        },
        "residenceCantonCode" : {
          "type" : "keyword"
        },
        "showProtectedData" : {
          "type" : "boolean"
        },
        "workForms" : {
          "type" : "keyword"
        },
        "workLoad" : {
          "type" : "integer"
        }
      }
    }
  }
}

Sample data:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "candidate-profiles",
        "_type" : "_doc",
        "_id" : "57070211-81ac-4b2f-80ac-491617d78277",
        "_score" : 1.0,
        "_source" : {
          "_class" : "ch.admin.seco.service.candidate.domain.candidate.elasticsearch.CandidateProfileDocument",
          "id" : "57070211-81ac-4b2f-80ac-491617d78277",
          "externalId" : "plrggAwm7UudpyOwutb0oHLCbI8EzueE",
          "public" : true,
          "protected" : false,
          "showProtectedData" : false,
          "gender" : "MALE",
          "availability" : "IMMEDIATE",
          "residenceCantonCode" : "en",
          "workForms" : [
            "SHIFT_WORK",
            "HOME_WORK"
          ],
          "preferredWorkRegions" : [
            "zr",
            "xr"
          ],
          "preferredWorkCantons" : [
            "gr"
          ],
          "jobExperiences" : [
            {
              "occupation" : {
                "avamCode" : 12,
                "bfsCode" : 1112
              },
              "experience" : 3,
              "graduation" : "NONE",
              "degree" : 12,
              "education" : 4,
              "remark" : "Java developer",
              "isLastJob" : false,
              "wanted" : true
            },
            {
              "occupation" : {
                "avamCode" : 11,
                "bfsCode" : 1111
              },
              "experience" : 3,
              "graduation" : "NONE",
              "degree" : 12,
              "education" : 4,
              "remark" : "angular",
              "isLastJob" : true,
              "wanted" : true
            }
          ],
          "languages" : [
            {
              "code" : "en",
              "writtenLevel" : 2,
              "spokenLevel" : 3
            }
          ],
          "drivingCategories" : [
            "D",
            "B"
          ],
          "highestEducationLevel" : 4,
          "highestDegree" : "TER_BACHELOR_UNIVERSITAET",
          "workLoad" : 0,
          "availabilityEndDate" : "2021-04-10",
          "geoPoint" : {
            "lat" : 9.25,
            "lon" : 7.12
          }
        }
      },
      {
        "_index" : "candidate-profiles",
        "_type" : "_doc",
        "_id" : "fbbb486b-6116-459a-8dd7-6f1abb847155",
        "_score" : 1.0,
        "_source" : {
          "_class" : "ch.admin.seco.service.candidate.domain.candidate.elasticsearch.CandidateProfileDocument",
          "id" : "fbbb486b-6116-459a-8dd7-6f1abb847155",
          "externalId" : "xllDWQLiHb30vL0XrBy87MZsMHMF8AsZ",
          "public" : true,
          "protected" : false,
          "showProtectedData" : false,
          "gender" : "MALE",
          "availability" : "IMMEDIATE",
          "residenceCantonCode" : "en",
          "workForms" : [
            "SHIFT_WORK",
            "HOME_WORK"
          ],
          "preferredWorkRegions" : [
            "zr",
            "xr"
          ],
          "preferredWorkCantons" : [
            "gr"
          ],
          "jobExperiences" : [
            {
              "occupation" : {
                "avamCode" : 12,
                "bfsCode" : 1112
              },
              "experience" : 3,
              "graduation" : "NONE",
              "degree" : 12,
              "education" : 4,
              "remark" : "Java software developer with angular experience",
              "isLastJob" : true,
              "wanted" : true
            }
          ],
          "languages" : [
            {
              "code" : "en",
              "writtenLevel" : 2,
              "spokenLevel" : 3
            }
          ],
          "drivingCategories" : [
            "D",
            "B"
          ],
          "highestEducationLevel" : 4,
          "highestDegree" : "TER_BACHELOR_UNIVERSITAET",
          "workLoad" : 0,
          "availabilityEndDate" : "2021-04-10",
          "geoPoint" : {
            "lat" : 9.25,
            "lon" : 7.12
          }
        }
      },
      {
        "_index" : "candidate-profiles",
        "_type" : "_doc",
        "_id" : "3c0a9098-93dc-45fc-87ea-73eeea8cc66c",
        "_score" : 1.0,
        "_source" : {
          "_class" : "ch.admin.seco.service.candidate.domain.candidate.elasticsearch.CandidateProfileDocument",
          "id" : "3c0a9098-93dc-45fc-87ea-73eeea8cc66c",
          "externalId" : "O5kcwg5G43jTh6a38UqlNM7DCPa3aUIr",
          "public" : true,
          "protected" : false,
          "showProtectedData" : false,
          "gender" : "MALE",
          "availability" : "IMMEDIATE",
          "residenceCantonCode" : "en",
          "workForms" : [
            "SHIFT_WORK",
            "HOME_WORK"
          ],
          "preferredWorkRegions" : [
            "zr",
            "xr"
          ],
          "preferredWorkCantons" : [
            "gr"
          ],
          "jobExperiences" : [
            {
              "occupation" : {
                "avamCode" : 13,
                "bfsCode" : 1113
              },
              "experience" : 3,
              "graduation" : "NONE",
              "degree" : 12,
              "education" : 4,
              "remark" : "Objective-C/Swift developer",
              "isLastJob" : true,
              "wanted" : true
            }
          ],
          "languages" : [
            {
              "code" : "en",
              "writtenLevel" : 2,
              "spokenLevel" : 3
            }
          ],
          "drivingCategories" : [
            "D",
            "B"
          ],
          "highestEducationLevel" : 4,
          "highestDegree" : "TER_BACHELOR_UNIVERSITAET",
          "workLoad" : 0,
          "availabilityEndDate" : "2021-04-10",
          "geoPoint" : {
            "lat" : 9.25,
            "lon" : 7.12
          }
        }
      }
    ]
  }
}

Here is the query, as I have it right now:


{
  "highlight": {
    "fields": {
      "jobExperiences.remark": {
        "fragment_size": 300,
        "number_of_fragments": 10
      }
    }
  },
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "query": {
              "bool": {
                "must": [
                  {
                    "match_phrase_prefix": {
                      "jobExperiences.remark": {
                        "query": "developer",
                        "slop": 0,
                        "max_expansions": 50,
                        "boost": 1
                      }
                    }
                  },
                  {
                    "term": {
                      "jobExperiences.wanted": {
                        "value": true,
                        "boost": 1
                      }
                    }
                  }
                ],
                "adjust_pure_negative": true,
                "boost": 1
              }
            },
            "path": "jobExperiences",
            "ignore_unmapped": false,
            "score_mode": "avg",
            "boost": 1
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1
    }
  }
}```

1. Build a query for 6.6.2 with highlight field.
2. Query results contain highlighted values
3. Take the exact same query into 7.6.2. Query returns the correct results, but without highlight at all.
elasticmachine commented 3 years ago

Pinging @elastic/es-search (Team:Search)

seco-mgabor commented 3 years ago

Hi there, I would like to know how long does it take until an analysis gets started for a ticket. Thanks!

jtibshirani commented 3 years ago

@seco-mgabor would you be able to add more details to the reproduction, including a mapping and example document? That helps ensure we fully understand the bug you're running into.

As an initial note, I tried to reproduce on my own, and suspect the problem is with match_phrase_prefix. If I change this to match_phrase, then we return highlights (given the query still matches).

seco-mgabor commented 3 years ago

@jtibshirani thank you for your answer. Please let me try your suggestion first. :) As you've suggested, this solution works. A million thanks! But how should match_phrase_prefix be replaced? I mean, match_phrase_prefix will find me Angular JS if I search for angular, but match_phrase won't. Also, I need an explanation why it worked before and stopped working with the upgrade to 7.6.2?

seco-mgabor commented 3 years ago

What I have tried too was search_as_you_type but that doesn't work with highlighting either. :(

seco-mgabor commented 3 years ago

I've provided the mapping and some sample data, I hope someone can take a look at this issue.

jtibshirani commented 3 years ago

Thank you @seco-mgabor for the reproduction steps. I confirmed that this is a bug: the unified highlighter (which is the default) does not give correct highlights on match_phrase_prefix queries. This regression happened in Elasticsearch 7.3, it works in 7.2 and before .

We still need to debug what caused the regression. Until we fix it, the options are to use a different highlighter type like plain or switch to a new query (given your use case, these may not be possible), or to use ES 7.2.

But how should match_phrase_prefix be replaced?

Sorry for the confusion, I was not suggesting that match_phrase was a direct replacement for match_phrase_prefix. I was only noting that the bug seems related to match_phrase_prefix in particular.

seco-mgabor commented 3 years ago

Salut @jtibshirani , thanks a lot for your answer. Well, I was asking myself the same questions. First of all, I need to see what's the plain highlighter as I am not very experienced using Elastic.

Unfortunately 7.2 is not an option, as it has reached EOL already. :(

Theoretically, we can migrate to SpringBoot 2.4.x, which is compatible with Elastic 7.9.3. But there's already trouble with migration to 7.6.2, I'm honestly doubting that would be easier.

Just as an accolade, according to this only the 7.6.2 is compatible with SpringBoot 2.3.x.

Regards, MG

cbuescher commented 3 years ago

I did some debugging on this and think I found a difference that was introduced shortly after 7.2 while updating to a newer Lucene snapshot. Some changes in the CustomUnifiedHighlighters query rewrite logic seems to have changed. Will open a PR shortly to dicuss the possible fix there.

seco-mgabor commented 3 years ago

Yeeey!!! Thanks for the good news!

seco-mgabor commented 3 years ago

May I ask in which version will be this fix available?

cbuescher commented 3 years ago

I backported the PR to 7.14 which will be our next minor release.