Closed ianribas closed 7 years ago
Hi @ianribas
Thanks for writing this up. @mikemccand is there any way we could improve multi-word synonym queries with the TermAutomatonQuery?
http://blog.mikemccandless.com/2014/08/a-new-proximity-query-for-lucene-using.html
It seems the problem is the same stated on the "Limitations" sections of http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html (linked on the post referenced above): both the SynonymFilter and the QueryBuilder are not able to correctly handle the situation where the synonym and the original term have different lengths on "AND" queries.
I think this issue in Lucene relates to this problem: LUCENE-3843.
Thanks for the reminder @ianribas , I had completely forgotten about this issue! As a start maybe we can fix QueryBuilder to make use of positionLength where available and formulate the queries better. The logic in that thing is kind of scary and hairy, but I will take a look.
You're welcome, @rmuir. I looked around QueryBuilder a bit and the logic is really complex already. And I fear this situation will only add more special cases. Please let me know if I can be of any assistance.
You are correct @ianribas that the code is unapproachable. But we can't give up on it and just let it stagnate.
A lot of the complexity is because the tokenstream api (used to consume the analysis chain) is awkward to use here (additional state must be kept because it can only be consumed "forward-only').
We could consider another approach, such as converting the tokenstream to an automaton (we have a TokenstreamToAutomaton somewhere), and then consuming that. Maybe it would simplify all the code around this thing.
I just want to think about all cases involved first: its not just AND/OR but also impacts the phrase operator. If you have this same situation in quotes, I think we should be using something like @mikemccand 's TermAutomatonQuery. I am not sure what state its in, mike put it in the sandbox i am sure for good reasons, and i'm not sure it supports slop yet.
TermAutomatonQuery is in sandbox just because it's so new and likely quite slow (missing optos like https://issues.apache.org/jira/browse/LUCENE-6396 that @rmuir just opened), but it can run arbitrary token-level automata (each transition is a token), including ones with cycles I think (which our token streams cannot produce).
But I think the original issue here is not about positional querying but rather about consume the multiple tokens ("spider" and "man") created by the synonym filter yet not doing the right thing when the operator is "and", i.e. the query should effectively rewrite to +spider +man (except the impl in QueryBuilder.createFieldQuery seems to mix these cases)...
except the impl in QueryBuilder.createFieldQuery seems to mix these cases
That is exactly what it makes it difficult, to fix the non-positional case (AND/OR) and still defer the proximity case until we have better solutions. I think its still possible, but not without adding another specialized case there (e.g. "multiple positions and lengths"). I will investigate this as an intermediate solution, maybe its not so bad.
Here is some progress:
Even after these, there is more work for synonyms before it can do the right thing in all circumstances. And the position case needs some work on something like TermAutomatonQuery before things in double-quotes will work correctly.
Before changing the logic in QueryBuilder, we need to also consider CJK/shingle/n-grams/commongrams/kuromoji that are setting positionLength and make sure everything makes sense too.
Any update on this? I was using Solr 4.7.2, upgraded to 5.5.0. Faced the problem of having multiword synonyms and operators(AND OR etc). See http://stackoverflow.com/questions/35823263/complex-queries-using-multiword-synonyms-in-solr-lucene
I thought I should move to ES because this is where the party is going on. But I'm not sure that multiword synonyms and operators would work well here too after seeing this bug. Can someone update me on this?
This now works correctly, if you switch from using the synonym
token filter to the synonym_graph
token filter (as a search_analyzer
- to use synonym_graph
as an index-time analyzer
, you also need to add the flatten_graph
token filter
# delete old index if exists
DELETE /multiwordsyns?pretty
# create index with synonym analyzer and mapping
PUT /multiwordsyns?pretty
{
"settings": {
"number_of_replicas": 0,
"number_of_shards": 1,
"index": {
"analysis": {
"analyzer": {
"index_time": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase"
]
},
"synonym": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"stop",
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym_graph",
"synonyms": [
"spider man, spiderman"
]
}
}
}
}
},
"mappings": {
"test": {
"properties": {
"text": {
"type": "text",
"analyzer": "index_time",
"search_analyzer": "synonym"
}
}
}
}
}
# index the test documents
PUT /multiwordsyns/test/1?pretty
{"text": "the adventures of spiderman"}
PUT /multiwordsyns/test/2?pretty
{"text": "what hath man wrought?"}
PUT /multiwordsyns/test/3?pretty
{"text": "that spider is the size of a man"}
PUT /multiwordsyns/test/4?pretty&refresh=true
{"text": "spiders eat insects"}
# Works - finds #1 & #3
POST /multiwordsyns/test/_search?pretty
{"query": {"match": {"text": {"query": "spiderman", "operator": "and"}}}}
# Works - finds #1 & #3
POST /multiwordsyns/test/_search?pretty
{"query": {"match": {"text": {"query": "spider man", "operator": "and"}}}}
This issue is very similar to what is documented on the entry about multiword synonyms and phrase queries . What is different is that the same kind of problems, when the synonyms are of different lengths (in number of terms), occur when using the standard
match
query with theoperator
flag set toand
. In this case too, some documents may unexpectedly not match.Here is an example that reproduces the behavior:
Also available as a gist: https://gist.github.com/ianribas/f76d20c21bb9f5c0df2f
If the synonyms are applied at index time, the example above works. This can be used as a workaround, but is a choice that has other impacts, as described on the documentation.
It took us a while to identify this problem, so I thought it was important to at least write it down so it could maybe help others.