Closed Kamapcuc closed 8 years ago
Hi @Kamapcuc
The problem with changing the current logic is that eg synonyms will no longer work as expected. For instance, let's say that you have "jumps" as a synonym for "leaps", using query time synonym expansion.
A query for "the quick fox jumps" would become "the quick fox (jumps|leaps)" - unless you have also used index time expansion, this query will never match the indexed docs. Instead we only require one token in each position.
Token filters are not allowed to change positions or offsets, so the ngram token filter will always produce stacked tokens.
However, you can use the ngram
tokenizer to do exactly what you need. In your example, change the analyzer to the following:
PUT test
{
"settings": {
"analysis": {
"tokenizer": {
"myGramTok": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter","digit"
]
}
},
"analyzer": {
"myGramAn": {
"tokenizer": "myGramTok"
}
}
}
}
}
With this in place, your match
query works as expected.
Hi, Clinton!
Of course I understand, that QueryBuilder:287 are made for synonyms (we use them, and they are awesome). But the reason doesn't change that those situation with minimum_should_match
is a bug.
It works not like intended without any excuse. Here there is not any word about that it doesn't work with grams. (and it takes me several hours to find why our big, complicated, based on multi_match
query sometimes returns very strange results)
Your solution is not appropriable, because we use snowball token filter. Also, I can't totally remove n-gram because when I search "conductor" I want to find "semiconductor" too.
Maybe there is some option to remove TF/IDF and replace it with constant? For example each term adds 1 to score, so I'll be able to make cutoff with min_score
instead?
Or maybe you will change work of analyzer and position becomes array? For example:
GET test/_analyze?analyzer=myGramAn&text=1234 5678
//-------------------
{
"tokens" : [{
"token" : "123",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : [1, 1]
}, {
"token" : "234",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : [1, 2]
}, {
"token" : "567",
"start_offset" : 5,
"end_offset" : 9,
"type" : "word",
"position" : [2, 1]
}, {
"token" : "678",
"start_offset" : 5,
"end_offset" : 9,
"type" : "word",
"position" : [2, 2]
}
]
}
Or something like that...
Is there any progress in that problem?
No. I spoke to @mikemccand about this and there is no way to fix this other than what I have already suggested. You say you can't use the ngram tokenizer because you're using the snowball filter. Honestly it doesn't make sense to combine those two - the ngrams tokenizer makes the snowball filter redundant.
Regardless, any major change to analysis like this would have to go into Lucene first.
We use synonyms only with expand token filter in index analyzer and don't use them in search analyzer. So, at that time i just brute force it: https://github.com/Kamapcuc/lucene-solr/commit/33cb9491788980d47f88bb8df1bf5aed179cbecf But i think other people can confront with those proberm too.
Nothing to do here
How to reproduce bug:
Analyzer settings:
Mapping:
Documents:
Now I want to find documents, which contain those four grams: "123", "234", "567", and "678". All of them occur only in second document. So, if I will use terms query, it would return right answer:
But if I change it to match query, result will be wrong:
But they should be the same:
Result will be identical, even if I’ll change minimum_should_match to "100%" or "2", "3", "-1", "-2".
The root of this mistake lies at
org.apache.lucene.util.QueryBuilder:287
(lucene 4.10.1). Method createFieldQuery divides terms by position increment into two should queries and wraps it with another should with ours minimum_should_match:(myField:123 myField:234)(myField:567 myField:678)
. That's why results are identical with other values of minimum_should_match. All other kinds of match\ queries are affected too.