apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.67k stars 1.03k forks source link

Deprecate PositionFilter [LUCENE-4981] #6045

Closed asfimport closed 11 years ago

asfimport commented 11 years ago

According to the documentation (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory), PositionFilter is mainly useful to make query parsers generate boolean queries instead of phrase queries although this problem can be solved at query parsing level instead of analysis level (eg. using QueryParser.setAutoGeneratePhraseQueries).

So given that PositionFilter corrupts token graphs (see TestRandomChains), I propose to deprecate it.


Migrated from LUCENE-4981 by Adrien Grand (@jpountz), resolved May 18 2013 Attachments: LUCENE-4981.patch (versions: 2)

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Here is the patch for 4.x. The patch for trunk is simpler as PositionFilter and PositionFilterFactory would simply be removed.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Adrien, can you hold off committing for a little bit? I'm not sure if QueryParser.setAutoGeneratePhraseQueries is sufficient for all cases that the PositionFilter hack addresses - I want to do some investigation.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Sure I can wait. (Even when committed, the old behavior will still be available by using luceneMatchVersion=4.3).

I would like to start marking all our broken components (the offenders in TestRandomChains) as deprecated so that people start thinking about ways to solve their problems without them, stop getting highlighting bugs and can eventually smoothly upgrade to 5.0 when we release it. I already started deprecating/fixing some tokenizers / token filters for 4.4 (LUCENE-4955 and LUCENE-4963) and would like to get as many of them fixed as possible for the next release.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Thanks for working on fixing the broken stuff.

In addition to use cases, I want to investigate the exact nature of the brokenness PositionFilter introduces - maybe it's fixable? I'll re-enable it in TestRandomChains and iterate until it breaks.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'm not sure its fixable: by definition it corrupts the structure because you lose all posincs. so synonyms no longer become synonyms, holes disappear, or whatever. and this doesnt even factor in posLength...

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

The comment in TestRandomChains says:

// TODO: corrumpts graphs (offset consistency check):
PositionFilter.class,

which is what made me wonder what about the nature of brokenness: why are offsets a problem?

I agree, Robert, PositionFilter corrupts by design. And if we do end up keeping it, position length should be addressed (it's not now), maybe by always setting it to 1.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

why are offsets a problem?

There are invariants that need to be maintained by token filters: all tokens that start at the same position must have the same start offset and all tokens that end at the same position (start position + position length) must have the same end offset (see ValidatingFilter). By arbitrarily changing position increments, PositionFilter breaks these invariants.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

which is what made me wonder what about the nature of brokenness: why are offsets a problem?

I think Adrien describes it correctly: afaik it doesn't do anything super-evil like make start offsets go backwards or anything, but it breaks those invariants Adrien describes which can cause a follow-on-filter (e.g. shingle) to cause further craziness, e.g. things going backwards or endOffset < startOffset or other problems.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Thanks for the pointer Adrien, I'll take a look at ValidatingFilter.

It might be possible, by creating new positions, to enable offset consistency in PositionFilter. Not sure it's worth the effort though.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

The Validatingfilter should be the same logic in BaseTokenStreamTestCase:196

I think its in a separate filter because then its applied at each "stage" of the analysis in TestRandomChains so if there is a bug in a complex analysis chain we know the culprit.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Steve, may I commit this patch?

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Adrien, 24 more hours, please?

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Oh sure, I didn't want to rush you. Take the time you need.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Adrien,

I looked http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory, and where it originally came from (http://markmail.org/message/g4habmbyeuckmix6 and LUCENE-1380), and I don't think existing query parser functionality, including QueryParser.setAutoGeneratePhraseQueries, will cover the use case it was created to handle.

That use case is roughly: given an indexed non-tokenized string field (e.g. "a b"), and a multi-word query against that field, create disjunction query of all possible word n-grams, where 0<n<N and N is as large as the expected longest query. E.g. a query "a b c" would result in "'a' OR 'a b' OR 'a b c' OR 'b' OR 'b c' OR 'c'", and would match a doc with field value "a b".

Michael Semb Wever, the guy who started the thread and created the issue, was able to handle this use case by stringing together:

  1. Quoting the query, to allow the configured analyzer to see all of the terms instead of one-at-a-time
  2. ShingleFilter, to create the n-grams
  3. The new PositionFilter, to place all terms at the same position
  4. QueryParser's synonym handling functionality, which produces a MultiPhraseQuery, which when given multiple terms at the same single position, creates a BooleanQuery with one SHOULD TermQuery for each term.

Without PositionFilter, is there some way to achieve the same goal?

I don't think we should get rid of PositionFilter unless we have an alternate way to handle the (IMHO legitimate) use case it was originally designed to cover.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Without PositionFilter, is there some way to achieve the same goal?

My understanding is that this use-case would like the query parser to interpret phrase queries as sub-phrase queries. But instead of creating a specific query parser in order to process phrase queries differently (by overriding newFieldQuery for example), it tries to hack the token stream so that the default query parser generates the expected query. So I don't really think this is a use-case for PositionFilter?

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

So I don't really think this is a use-case for PositionFilter?

I agree, subclassing the QP and overriding newFieldQuery and getFieldQuery should be sufficient to handle this use case. Current PositionFilter users will have to maintain their own code outside of Lucene and Solr's codebase, rather than having a configuration-only solution.

I think the @deprecated annotation on PositionFilter in branch_4x should be augmented to help people find this alternative. Similarly, in the backcompat section of trunk CHANGES.txt, and/or MIGRATE.txt, this issue should be mentioned.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Thanks for your feedback Steve. I updated the patch to say that the problems that were solved by PositionFilter should be solved at query parsing level, does it look better?

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

+1, thanks Adrien!