elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.53k stars 24.61k forks source link

With the classic tokenizer, a wildcarded query_string of e.g. T.* doesn't respect the dot #23737

Closed mrec closed 7 years ago

mrec commented 7 years ago

Elasticsearch version: 2.3.2

Plugins installed: [analysis-icu 2.3.2, analysis-korean 2.3.2, analysis-kuromoji 2.3.2, analysis-smartcn 2.3.2, analysis-stconvert 1.8.3, delete-by-query 2.3.2, enhanced-highlighter 2.3.2-1.6, enhanced-plain-highlighter 2.3.2-1.7, entsearch 2.3.2-1.6, head master, license 2.3.2, marvel-agent 2.3.2, reindex 2.3.2, repository-hdfs 2.3.1]

JVM version: 1.8.0

OS version: Oracle Linux Server 6.5

Description of the problem including expected versus actual behavior:

We're searching text that can include ticker-style symbols like T.A. There's an element of structure to these, and sometimes users want to wildcard the second part, i.e. T.*. The classic tokenizer (which we use) treats . as a token char when it's not followed by whitespace so we'd expect this to match only those indexed terms starting with T.. What it actually does is match all terms starting with T, which brings in a huge number of false positives.

This feels wrong since even though the input doesn't contain token characters after the ., the wildcard is clearly standing in for token characters here. You could argue that T.* should also match T. (with no characters after the dot) and the classic tokenizer would have dropped the trailing dot there, but this argument doesn't apply to e.g. T.? or T.?* and those don't work either.

Steps to reproduce:

PUT mrec
{
  "settings": {
    "index":{
      "analysis":{
        "analyzer":{
          "demo":{
            "tokenizer":"classic",
            "filter":["lowercase"]
          }
        }
      }
    }
  }, 
  "mappings": {
    "doc": {
      "properties": {
        "s": {
          "type": "string",
          "analyzer": "demo"
        }
      }
    }
  }
}

PUT mrec/doc/1
{
  "s":"T.A"
}

PUT mrec/doc/2
{
  "s":"trombone"
}

GET mrec/_search
{
  "fielddata_fields": ["s"], 
  "query": {
    "query_string": {
      "query": "s:(T.*)",
      "analyze_wildcard": true,
      "default_operator": "AND"
    }
  }
}
javanna commented 7 years ago

Hi @mrec, please ask questions like these in the forum instead: https://discuss.elastic.co/. The github issues list is reserved for bug reports and feature requests only. Thanks!

The difference is made by analyze_wildcard. If you set it to false you should get the expected result. That seems to show that the classic tokenizer leaves the . only when it's between tokens. Have a look at the analyze api output for the following requests:

GET /mrec/_analyze?text=T.A&tokenizer=classic
#. is kept

GET /mrec/_analyze?text=T.&tokenizer=classic
#. is dropped

The former shows how your documents get analyzed, while the latter shows how the T.* token which contains a wildcard gets analyzed (only if you set analyze_wildcard to true).

Also looking at the explain output for your query would have showed the problem, as the query doesn't contain the ..

mrec commented 7 years ago

Yes, I understand that the classic tokenizer is dropping a . in suffix position but not in infix position; that's described in the doc I linked. My point is that at least for T.?*, and probably for T.*, the caller is very obviously wanting to match an indexed term where the . is not the last character, and therefore the . before the wildcard should not be treated as the last character and should not be dropped when analyzing the query.

I understand that turning off analyze_wildcard would solve this specific case, but would almost certainly break other cases; there's a reason that flag exists.

javanna commented 7 years ago

I understand what you mean @mrec . Unfortunately the way analyze_wildcard works is different though. I wasn't suggesting to remove analyze_wildcard, rather to show how it makes the difference. I would look into using a different analyzer that always keeps the .. Would that be an option?

mrec commented 7 years ago

I would look into using a different analyzer that always keeps the .. Would that be an option?

Unfortunately we're searching against multiple fields with different analyzers. The field we want this wildcard to match is already using a different analyzer designed for symbols, which doesn't have this problem. The false positives are coming from a more general arbitrary-text field which does use the classic tokenizer.

javanna commented 7 years ago

Maybe you could change the query analyzer then rather than the field analyzer? I am afraid with classic tokenizer and analyze_wildcards you are not going to get what you want.

mrec commented 7 years ago

Maybe, though I'm not sure what we'd change it to. By the way, the standard analyzer tokenizes T.A and T. the same way as the classic tokenizer, and has the exact same problem with wildcards.

It's probably worth us experimenting with analyze_wildcards again and seeing just how bad the problems are and whether they're easier to work around than this is. Ah well. Thanks for the response, anyway; I didn't have high hopes of getting this fixed, but it's good to have a concrete answer to refer to.

javanna commented 7 years ago

Maybe you could open a discuss post, I think it is more appropriate than an issue and others may chime in as well with other ideas.