Multiword query time synonyms and match queries with the "and" operator may not work

ianribas commented 9 years ago

This issue is very similar to what is documented on the entry about multiword synonyms and phrase queries . What is different is that the same kind of problems, when the synonyms are of different lengths (in number of terms), occur when using the standard match query with the operator flag set to and. In this case too, some documents may unexpectedly not match.

Here is an example that reproduces the behavior:

# delete old index if exists
curl -XDELETE 'http://localhost:9200/multiwordsyns?pretty'

# create index with synonym analyzer and mapping
curl -XPUT 'http://localhost:9200/multiwordsyns?pretty' -d '{
    "settings" : {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "index": {
            "analysis": {
                "analyzer": {
                    "index_time": {
                        "tokenizer": "standard",
                        "filter": ["standard", "lowercase"]
                    },
                    "synonym": {
                        "tokenizer": "standard",
                        "filter": ["standard", "lowercase", "stop", "synonym"]
                    }
                },
                "filter": {
                    "synonym": {
                        "type": "synonym",
                        "synonyms": [
                            "spider man, spiderman"
                        ]
                    }
                }
            }
        }
    },
    "mappings": {
        "test": {
            "properties": {
                "text": {"type": "string", "index_analyzer": "index_time", "search_analyzer": "synonym"}
            }
        }
    }
}'

# index the test documents
curl -XPUT 'http://localhost:9200/multiwordsyns/test/1?pretty' -d '{"text": "the adventures of spiderman"}'
curl -XPUT 'http://localhost:9200/multiwordsyns/test/2?pretty' -d '{"text": "what hath man wrought?"}'
curl -XPUT 'http://localhost:9200/multiwordsyns/test/3?pretty' -d '{"text": "that spider is the size of a man"}'
curl -XPUT 'http://localhost:9200/multiwordsyns/test/4?pretty&refresh=true' -d '{"text": "spiders eat insects"}'

# WRONG! finds only #1, should find #1 & #3
curl -XPOST 'http://localhost:9200/multiwordsyns/test/_search?pretty' -d '{"query": {"match": {"text": {"query": "spiderman", "operator": "and"}}}}'

# Also WRONG! finds only #1, should find #1 & #3
curl -XPOST 'http://localhost:9200/multiwordsyns/test/_search?pretty' -d '{"query": {"match": {"text": {"query": "spider man", "operator": "and"}}}}'

Also available as a gist: https://gist.github.com/ianribas/f76d20c21bb9f5c0df2f

If the synonyms are applied at index time, the example above works. This can be used as a workaround, but is a choice that has other impacts, as described on the documentation.

It took us a while to identify this problem, so I thought it was important to at least write it down so it could maybe help others.

clintongormley commented 9 years ago

Hi @ianribas

Thanks for writing this up. @mikemccand is there any way we could improve multi-word synonym queries with the TermAutomatonQuery?

http://blog.mikemccandless.com/2014/08/a-new-proximity-query-for-lucene-using.html

ianribas commented 9 years ago

It seems the problem is the same stated on the "Limitations" sections of http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html (linked on the post referenced above): both the SynonymFilter and the QueryBuilder are not able to correctly handle the situation where the synonym and the original term have different lengths on "AND" queries.

I think this issue in Lucene relates to this problem: LUCENE-3843.

rmuir commented 9 years ago

Thanks for the reminder @ianribas , I had completely forgotten about this issue! As a start maybe we can fix QueryBuilder to make use of positionLength where available and formulate the queries better. The logic in that thing is kind of scary and hairy, but I will take a look.

ianribas commented 9 years ago

You're welcome, @rmuir. I looked around QueryBuilder a bit and the logic is really complex already. And I fear this situation will only add more special cases. Please let me know if I can be of any assistance.

rmuir commented 9 years ago

You are correct @ianribas that the code is unapproachable. But we can't give up on it and just let it stagnate.

A lot of the complexity is because the tokenstream api (used to consume the analysis chain) is awkward to use here (additional state must be kept because it can only be consumed "forward-only').

We could consider another approach, such as converting the tokenstream to an automaton (we have a TokenstreamToAutomaton somewhere), and then consuming that. Maybe it would simplify all the code around this thing.

I just want to think about all cases involved first: its not just AND/OR but also impacts the phrase operator. If you have this same situation in quotes, I think we should be using something like @mikemccand 's TermAutomatonQuery. I am not sure what state its in, mike put it in the sandbox i am sure for good reasons, and i'm not sure it supports slop yet.

mikemccand commented 9 years ago

TermAutomatonQuery is in sandbox just because it's so new and likely quite slow (missing optos like https://issues.apache.org/jira/browse/LUCENE-6396 that @rmuir just opened), but it can run arbitrary token-level automata (each transition is a token), including ones with cycles I think (which our token streams cannot produce).

But I think the original issue here is not about positional querying but rather about consume the multiple tokens ("spider" and "man") created by the synonym filter yet not doing the right thing when the operator is "and", i.e. the query should effectively rewrite to +spider +man (except the impl in QueryBuilder.createFieldQuery seems to mix these cases)...

rmuir commented 9 years ago

except the impl in QueryBuilder.createFieldQuery seems to mix these cases

That is exactly what it makes it difficult, to fix the non-positional case (AND/OR) and still defer the proximity case until we have better solutions. I think its still possible, but not without adding another specialized case there (e.g. "multiple positions and lengths"). I will investigate this as an intermediate solution, maybe its not so bad.

rmuir commented 9 years ago

Here is some progress:

https://issues.apache.org/jira/browse/LUCENE-6400 SolrSynonymParser (its the syntax used here by ES) doesn't really construct the map correctly. positionsLengths are really not even available today in your situation due to this.
https://issues.apache.org/jira/browse/LUCENE-6401 Refactor the big method :)

Even after these, there is more work for synonyms before it can do the right thing in all circumstances. And the position case needs some work on something like TermAutomatonQuery before things in double-quotes will work correctly.

Before changing the logic in QueryBuilder, we need to also consider CJK/shingle/n-grams/commongrams/kuromoji that are setting positionLength and make sure everything makes sense too.

atuljangra commented 8 years ago

Any update on this? I was using Solr 4.7.2, upgraded to 5.5.0. Faced the problem of having multiword synonyms and operators(AND OR etc). See http://stackoverflow.com/questions/35823263/complex-queries-using-multiword-synonyms-in-solr-lucene

I thought I should move to ES because this is where the party is going on. But I'm not sure that multiword synonyms and operators would work well here too after seeing this bug. Can someone update me on this?

clintongormley commented 7 years ago

This now works correctly, if you switch from using the synonym token filter to the synonym_graph token filter (as a search_analyzer - to use synonym_graph as an index-time analyzer, you also need to add the flatten_graph token filter

# delete old index if exists
DELETE /multiwordsyns?pretty

# create index with synonym analyzer and mapping
PUT /multiwordsyns?pretty
{
  "settings": {
    "number_of_replicas": 0,
    "number_of_shards": 1,
    "index": {
      "analysis": {
        "analyzer": {
          "index_time": {
            "tokenizer": "standard",
            "filter": [
              "standard",
              "lowercase"
            ]
          },
          "synonym": {
            "tokenizer": "standard",
            "filter": [
              "standard",
              "lowercase",
              "stop",
              "synonym"
            ]
          }
        },
        "filter": {
          "synonym": {
            "type": "synonym_graph",
            "synonyms": [
              "spider man, spiderman"
            ]
          }
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "text": {
          "type": "text",
          "analyzer": "index_time",
          "search_analyzer": "synonym"
        }
      }
    }
  }
}

# index the test documents
PUT /multiwordsyns/test/1?pretty
{"text": "the adventures of spiderman"}

PUT /multiwordsyns/test/2?pretty
{"text": "what hath man wrought?"}

PUT /multiwordsyns/test/3?pretty
{"text": "that spider is the size of a man"}

PUT /multiwordsyns/test/4?pretty&refresh=true
{"text": "spiders eat insects"}

# Works - finds #1 & #3
POST /multiwordsyns/test/_search?pretty
{"query": {"match": {"text": {"query": "spiderman", "operator": "and"}}}}

# Works - finds #1 & #3
POST /multiwordsyns/test/_search?pretty
{"query": {"match": {"text": {"query": "spider man", "operator": "and"}}}}

elastic / elasticsearch

Multiword query time synonyms and match queries with the "and" operator may not work #10394