elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.93k stars 24.74k forks source link

match_phrase_prefix is ​​not accurate when searching for keywords that are a combination of numbers and letters #111387

Closed piemon-nyah closed 1 month ago

piemon-nyah commented 2 months ago

Elasticsearch Version

7.17.3

Installed Plugins

No response

Java Version

bundled

OS Version

AnolisOS Linux 7.9

Problem Description

I used word_delimiter_graph to handle the word segmentation problem of letters and numbers. But when I use match_phrase_prefix to search for a combination of letters and numbers such as df337, I can't find the results I want.

For example, when I insert a data containing DF33760BF_X4, I can find the data using df and df337, but not df33.

This problem does not occur when the index data size is small(single index has 3 million records,1.5GB). However, this problem occurs when the data size is large(single index has 30 million records,15GB ).

Steps to Reproduce

index setting

PUT v2-vmail-inbox-local-000001
{
  "settings": {
    "index": {
      "analysis": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "standard",
              "filter": [
                "word_delimiter_graph",
                "lowercase"
              ]
            }
          }
        }
      }
    }
  },"mappings": {
    "properties": {
      "mail_subject": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

demo-data

POST v2-vmail-inbox-local-000001/_doc/1111-125830ed339e4e49ad5ffa84d7ee6d08?routing=1111
{
          "receiver_code" : "1111",
          "mail_subject" : "阅读通知:DF33760BF_X4",
          "receive_time" : 1651802115000
        }

search-dsl

GET v2-vmail-inbox-local-000001/_search?routing=1111
{
  "sort": [
    {
      "receive_time": {
        "order": "desc"
      }
    }
  ], 
  "size": 50, 
  "_source": ["mail_subject"], 
  "query":{
    "bool": {
      "filter": [
        {"term": {
          "receiver_code": "1111"
        }},
        {

          "multi_match": {
            "query": "df337",
            "type": "phrase_prefix", 
            "fields": ["sender_name","mail_subject"]
          }}
      ]
    }
  }
}

both df or df337 can find data image image df33 can't find data image

Logs (if relevant)

I used analyze to confirm the token, df and 33760 are correct tokens

GET v2-vmail-inbox-local-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "阅读通知:DF33760BF_X4"
}

analyze result image

I also tried setting preserve_original=true, adjust_offsets=false. I also tried customizing the char_filter to separate alphabet and numbers

{
        "analyzer": {
          "my_analyzer": {
            "char_filter": [
              "custom_char_filter"
            ],
            "filter": [
              "lowercase"
            ],
            "tokenizer": "standard"
          }
        },
        "char_filter": {
          "custom_char_filter": {
            "type": "pattern_replace",
            "pattern": "(\\d)([a-zA-Z])|([a-zA-Z])(\\d)",
            "replacement": "$1$3 $2$4"
          }
        }
      }

But the result is the same, I can't search out the data through df33

piemon-nyah commented 2 months ago

After using profile, I found that term condition and multi_match are independent queries. Even if there is data in df33, it will not match because max_expansions is only 50 by default. Is there a way to prioritize the term condition and then match match_phrase_prefix to ensure that I get the result I want? image

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)

carlosdelest commented 1 month ago

Hi @piemon-nyah !

Even if there is data in df33, it will not match because max_expansions is only 50 by default.

That seems to be the reason why there's no data coming back. I don't think this is a bug so I'm closing this issue. In case you need more support you can check in our forums.

May I suggest to use a different analyzer for performing this search? You could use multifields for having a search-as-you-type multifield for filtering purposes?

piemon-nyah commented 1 month ago

@carlosdelest i know it will not match because max_expansions is only 50 by default. Is there a way to prioritize the term condition and then match match_phrase_prefix to ensure that I get the result I want?

carlosdelest commented 1 month ago

Hi @piemon-nyah ! I don't think there's any. Filters are not applied sequentially, and terms expansion will work independently of the other filters applied.

I believe using a different analyzer / different field for filtering purposes should be the way to go here.

piemon-nyah commented 1 month ago

@carlosdelest okay,thank you