elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
68.72k stars 24.41k forks source link

Match query parser doesn't have expected number of token with length filter #59348

Open PengYi-Elastic opened 4 years ago

PengYi-Elastic commented 4 years ago

Elasticsearch version (bin/elasticsearch --version): 7.7

Plugins installed: [analysis-kuromoji]

Description of the problem including expected versus actual behavior: When min length is set to 2 for length filter, I found there will be some cases which can not be highlighted properly. After taking a further look, seems _analyze call of query and query validate call give different tokens.

Steps to reproduce:

Example 1: Set min:2 for length token filter. Failed to highlight Create mapping & document:

PUT synonym_graph_test
{
  "settings": {
    "analysis" : {
      "analyzer" : {
        "search_kuromoji_tokenizer" : {
          "tokenizer" : "kuromoji_tokenizer",
          "filter" : [
            "graph_synonyms",
            "length"
          ]
        },
        "index_kuromoji_tokenizer" : {
          "tokenizer" : "kuromoji_tokenizer",
          "filter" : ["length"]
        }
      },
      "filter" : {
        "graph_synonyms" : {
          "type" : "synonym_graph",
          "synonyms" : ["1978年,昭和53年"]
        },
        "length":  {
            "min":  "2",
            "type":  "length",
            "max":  "256"
          }
      }
    }
  },  
  "mappings": {
    "properties":  {
      "message": {
        "doc_values":  "false",
        "search_analyzer":  "search_kuromoji_tokenizer",
        "type":  "text",
        "analyzer":  "index_kuromoji_tokenizer",
        "term_vector":  "with_positions_offsets"
      }
    }
  }
}

PUT synonym_graph_test/_bulk
{"index":{"_id":"1"}}
{"message":"昭和53年 1978年"}
{"index":{"_id":"2"}}
{"message":"昭和53年"}
{"index":{"_id":"3"}}
{"message":"1978年"}

Run query:

GET synonym_graph_test/_search
{
  "query": {
    "match": {
      "message": "1978年"
    }
  },
  "highlight": {
    "fields": {
      "message": {
        "type": "fvh"
      }
    }
  }
}

Results:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.59086174,
    "hits" : [
      {
        "_index" : "synonym_graph_test",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.59086174,
        "_source" : {
          "message" : "1978年"
        },
        "highlight" : {
          "message" : [
            "<em>1978</em>年"
          ]
        }
      },
      {
        "_index" : "synonym_graph_test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.39019167,
        "_source" : {
          "message" : "昭和53年 1978年"
        },
        "highlight" : {
          "message" : [
            "昭和53年 <em>1978</em>年"
          ]
        }
      }
    ]
  }
}

For this case, 昭和53 is also expected to be highlighted.

Example 2: Set min:1 for length token filter. Highlighted as expected Create mapping & document:

PUT synonym_graph_test
{
  "settings": {
    "analysis" : {
      "analyzer" : {
        "search_kuromoji_tokenizer" : {
          "tokenizer" : "kuromoji_tokenizer",
          "filter" : [
            "graph_synonyms",
            "length"
          ]
        },
        "index_kuromoji_tokenizer" : {
          "tokenizer" : "kuromoji_tokenizer",
          "filter" : ["length"]
        }
      },
      "filter" : {
        "graph_synonyms" : {
          "type" : "synonym_graph",
          "synonyms" : ["1978年,昭和53年"]
        },
        "length":  {
            "min":  "1",
            "type":  "length",
            "max":  "256"
          }
      }
    }
  },  
  "mappings": {
    "properties":  {
      "message": {
        "doc_values":  "false",
        "search_analyzer":  "search_kuromoji_tokenizer",
        "type":  "text",
        "analyzer":  "index_kuromoji_tokenizer",
        "term_vector":  "with_positions_offsets"
      }
    }
  }
}

PUT synonym_graph_test/_bulk
{"index":{"_id":"1"}}
{"message":"昭和53年 1978年"}
{"index":{"_id":"2"}}
{"message":"昭和53年"}
{"index":{"_id":"3"}}
{"message":"1978年"}

Run query:

GET synonym_graph_test/_search
{
  "query": {
    "match": {
      "message": "1978年"
    }
  },
  "highlight": {
    "fields": {
      "message": {
        "type": "fvh"
      }
    }
  }
}

Results:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.3922876,
    "hits" : [
      {
        "_index" : "synonym_graph_test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.3922876,
        "_source" : {
          "message" : "昭和53年 1978年"
        },
        "highlight" : {
          "message" : [
            "<em>昭和53年</em> <em>1978年</em>"
          ]
        }
      },
      {
        "_index" : "synonym_graph_test",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.1193295,
        "_source" : {
          "message" : "昭和53年"
        },
        "highlight" : {
          "message" : [
            "<em>昭和53年</em>"
          ]
        }
      },
      {
        "_index" : "synonym_graph_test",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.721618,
        "_source" : {
          "message" : "1978年"
        },
        "highlight" : {
          "message" : [
            "<em>1978年</em>"
          ]
        }
      }
    ]
  }
}

Since all tokens will be kept, it highlights all as expected.

With length=2, we get the following tokens from an _analyze call:

POST synonym_graph_test/_analyze
{
  "analyzer" : "search_kuromoji_tokenizer",
  "text" : "1978年"
}
{
  "tokens" : [
    {
      "token" : "昭和",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "1978",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0,
      "positionLength" : 3
    },
    {
      "token" : "53",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "SYNONYM",
      "position" : 1
    }
  ]
}

This should produce a boolean query that looks something like message:(昭和 53) OR message:1978. But the output of a query validate call gives this:

GET synonym_graph_test/_validate/query?explain=true
{
  "query": {
    "match": {
      "message": { "query" : "1978年"}
    }
  }
}
{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "valid" : true,
  "explanations" : [
    {
      "index" : "synonym_graph_test",
      "valid" : true,
      "explanation" : "message:1978"
    }
  ]
}

The only term in the query is 1978, and the other clause is not being generated.

elasticmachine commented 4 years ago

Pinging @elastic/es-search (:Search/Analysis)

elasticsearchmachine commented 1 week ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)