[Bugreport] wrong work of minimum_should_match

Kamapcuc commented 9 years ago

How to reproduce bug:

Analyzer settings:

PUT test
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "myGramAn" : {
                    "type" : "custom",
                    "tokenizer" : "standard",
                    "filter" : [
                        "myGram"
                    ]
                }
            },
            "filter" : {
                "myGram" : {
                    "type" : "nGram",
                    "min_gram" : 3,
                    "max_gram" : 3
                }
            }
        }
    }
}

Mapping:

PUT test/myType/_mapping
{
    "properties" : {
        "myField" : {
            "type" : "string",
            "analyzer" : "myGramAn"
        }
    }
}

Documents:

PUT test/myType/1
{
    "myField" : "12345"
}

PUT test/myType/2
{
    "myField" : "12345678"
}

PUT test/myType/3
{
    "myField" : "123 678"
}

Now I want to find documents, which contain those four grams: "123", "234", "567", and "678". All of them occur only in second document. So, if I will use terms query, it would return right answer:

GET test/myType/_search
{
    "query" : {
        "terms" : {
            "myField" : [ "123", "234", "567", "678" ],
            "minimum_should_match" : "4"
        }
    }
}
//-------------------
{
    "hits" : [
        {
            "_index" : "test",
            "_type" : "myType",
            "_id" : "2",
            "_score" : 0.61370564,
            "_source" : {
                "myField" : "12345678"
            }
        }
    ]
}

But if I change it to match query, result will be wrong:

GET test/myType/_search
{
    "query" : {
        "match" : {
            "myField" : {
                "query" : "1234 5678",
                "minimum_should_match" : "4"
            }
        }
    }
}
//-------------------
{
    "hits" : [
        {
            "_index" : "test",
            "_type" : "myType",
            "_id" : "2",
            "_score" : 0.61370564,
            "_source" : {
                "myField" : "12345678"
            }
        }, 
        {
            "_index" : "test",
            "_type" : "myType",
            "_id" : "3",
            "_score" : 0.07956372,
            "_source" : {
                "myField" : "123 678"
            }
        }
    ]
}

But they should be the same:

GET test/_analyze?analyzer=myGramAn&text=1234 5678
//-------------------
{
    "tokens" : [{
            "token" : "123",
            "start_offset" : 0,
            "end_offset" : 4,
            "type" : "word",
            "position" : 1
        }, {
            "token" : "234",
            "start_offset" : 0,
            "end_offset" : 4,
            "type" : "word",
            "position" : 1
        }, {
            "token" : "567",
            "start_offset" : 5,
            "end_offset" : 9,
            "type" : "word",
            "position" : 2
        }, {
            "token" : "678",
            "start_offset" : 5,
            "end_offset" : 9,
            "type" : "word",
            "position" : 2
        }
    ]
}

Result will be identical, even if I’ll change minimum_should_match to "100%" or "2", "3", "-1", "-2".

The root of this mistake lies at org.apache.lucene.util.QueryBuilder:287 (lucene 4.10.1). Method createFieldQuery divides terms by position increment into two should queries and wraps it with another should with ours minimum_should_match: (myField:123 myField:234)(myField:567 myField:678). That's why results are identical with other values of minimum_should_match. All other kinds of match\ queries are affected too.

clintongormley commented 9 years ago

Hi @Kamapcuc

The problem with changing the current logic is that eg synonyms will no longer work as expected. For instance, let's say that you have "jumps" as a synonym for "leaps", using query time synonym expansion.

A query for "the quick fox jumps" would become "the quick fox (jumps|leaps)" - unless you have also used index time expansion, this query will never match the indexed docs. Instead we only require one token in each position.

Token filters are not allowed to change positions or offsets, so the ngram token filter will always produce stacked tokens.

However, you can use the ngram tokenizer to do exactly what you need. In your example, change the analyzer to the following:

PUT test
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "myGramTok": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter","digit"
          ]
        }
      },
      "analyzer": {
        "myGramAn": {
          "tokenizer": "myGramTok"
        }
      }
    }
  }
}

With this in place, your match query works as expected.

Kamapcuc commented 9 years ago

Hi, Clinton!

Of course I understand, that QueryBuilder:287 are made for synonyms (we use them, and they are awesome). But the reason doesn't change that those situation with minimum_should_match is a bug. It works not like intended without any excuse. Here there is not any word about that it doesn't work with grams. (and it takes me several hours to find why our big, complicated, based on multi_match query sometimes returns very strange results)

Your solution is not appropriable, because we use snowball token filter. Also, I can't totally remove n-gram because when I search "conductor" I want to find "semiconductor" too.

Maybe there is some option to remove TF/IDF and replace it with constant? For example each term adds 1 to score, so I'll be able to make cutoff with min_score instead?

Or maybe you will change work of analyzer and position becomes array? For example:

GET test/_analyze?analyzer=myGramAn&text=1234 5678
//-------------------
{
    "tokens" : [{
            "token" : "123",
            "start_offset" : 0,
            "end_offset" : 4,
            "type" : "word",
            "position" : [1, 1]
        }, {
            "token" : "234",
            "start_offset" : 0,
            "end_offset" : 4,
            "type" : "word",
            "position" : [1, 2]
        }, {
            "token" : "567",
            "start_offset" : 5,
            "end_offset" : 9,
            "type" : "word",
            "position" : [2, 1]
        }, {
            "token" : "678",
            "start_offset" : 5,
            "end_offset" : 9,
            "type" : "word",
            "position" : [2, 2]
        }
    ]
}

Or something like that...

Kamapcuc commented 9 years ago

Is there any progress in that problem?

clintongormley commented 9 years ago

No. I spoke to @mikemccand about this and there is no way to fix this other than what I have already suggested. You say you can't use the ngram tokenizer because you're using the snowball filter. Honestly it doesn't make sense to combine those two - the ngrams tokenizer makes the snowball filter redundant.

Regardless, any major change to analysis like this would have to go into Lucene first.

Kamapcuc commented 9 years ago

We use synonyms only with expand token filter in index analyzer and don't use them in search analyzer. So, at that time i just brute force it: https://github.com/Kamapcuc/lucene-solr/commit/33cb9491788980d47f88bb8df1bf5aed179cbecf But i think other people can confront with those proberm too.

clintongormley commented 8 years ago

Nothing to do here

elastic / elasticsearch