Add a FixedShingleFilter [LUCENE-8202]

asfimport commented 6 years ago

In #4549 I tried to make a ShingleGraphFilter that could accept and emit arbitrary graphs, while duplicating all the functionality of the existing ShingleFilter. This ends up being extremely hairy, and doesn't play well with query parsers.

I'd like to step back and try and create a simpler shingle filter that can be used for index-time phrase tokenization only. It will have a single fixed shingle size, can deal with single-token synonyms, and won't emit unigrams.

Migrated from LUCENE-8202 by Alan Woodward (@romseygeek), resolved Mar 25 2018 Attachments: LUCENE-8202.patch (versions: 3), LUCENE-8202-fixes.patch

asfimport commented 6 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I suppose this is for fields that one might use in Solr "pf2" "pf3" etc ? Can ShingleGraphFilter do what this does too, albeit slower and with greater chance of bugs? It appears so. Maybe we only need one Factory, and the Factory can produce the Filter most appropriate based on the configuration?

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Thanks Alan, I agree this issue looks easier to address if we don't have to support all the features that ShingleFilter accumulated over the years and focus on index-time phrase tokenization. Some comments about the patch: - Can you make this filter experimental, ideally it would get merged with ShingleFilter eventually? - Can you clarify in the sentence about stacked shingles that this is how this filter differs from ShingleFilter. I think it's important for users to realize when it makes sense to use this one.

Let's not use upper case for variables that are not static? (GAP_TOKEN and END_TOKEN)
Let's make tokenPool an ArrayDeque instead for efficiency?
Token doesn't need the type attribute, does it?
advanceStack runs in quadratic time with shingleSize, is it something we can fix? If not then let's put an upper bound on the value that shingleSize may take (which might be a good move regardless).
Token.reset should use AttributeSource.copyTo rather than captureState+restoreState? This would save a lot of cloning.
If it only cares about index-time I'm confused why it tries to work on graph inputs, I think it's safe to assume and document that the FlattenGraphFilter must be applied prior to the FixedShingleFilter? Said otherwise, say you have an existing analyzer. I think we should have the goal that wrapping this analyzer (AnalyzerWrapper) with a FixedShingleFilter should emit exactly all the exact phrases of length shingleSize that you could search if you indexed with the wrapped analyzer directly? For instance if I look at the testMultiLengthPathShingles test, it indexes the shingle "the usa is". But this is not a phrase that you can search if you create the same analysis chain without the shingle filter in the end ("the" and "usa" are at position 0 and 1, but "is" is at position 6).

asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Updated patch, which I think should address all of Adrien's concerns.

advanceStack runs in quadratic time with shingleSize

I don't think it does? The inner loop just continues the outer loop, so it should be linear.

Maybe we only need one Factory

I'm really seeing these as two separate filters, one for query time that handles graphs, and one for index time that assumes a linear stream (or sausage :)), so I think a single factory would confuse things. But we can discuss that on the other issue, I think this is a simpler thing to get in first.

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

The inner loop just continues the outer loop, so it should be linear.

I had read too quickly, thanks for clarifying.

Your reasoning about the dedicated factory also makes sense to me.

The latest patch looks good to me, I'd just like testing to be done on canned token streams rather than using other analysis components like the whitespace tokenizer or the synonym token filter. Other than that +1 to push.

asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Updated patch using CannedTokenStream for tests. Will commit shortly.

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

+1

asfimport commented 6 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

+1, thanks Alan.

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit 230a77ce38ebe6294a06aebf23d85b68223b6ec2 in lucene-solr's branch refs/heads/branch_7x from @romseygeek https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=230a77c

LUCENE-8202: Add FixedShingleFilter

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit fac84c01c84b3693a8c1251ae77f349c38497e06 in lucene-solr's branch refs/heads/master from @romseygeek https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fac84c0

LUCENE-8202: Add FixedShingleFilter

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit bd6cf168e0e129fa22545a7f614b2b146bd5f202 in lucene-solr's branch refs/heads/branch_7x from @romseygeek https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bd6cf16

LUCENE-8202: Add checks for shingle size

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit 2c4b78c43fe2e30ef748af34a1daa174d66e29cc in lucene-solr's branch refs/heads/master from @romseygeek https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2c4b78c

LUCENE-8202: Add checks for shingle size

asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Reopening to address a couple of test failures

asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

TestRandomChains has found two issues:

positionLength should be 1, rather than the shingle length. We don't have any intermediary tokens, only shingles, so we're not building graphs. TRC found this by feeding the output into FlattenGraphFilter, which then complained.
we need somehow limit either the length of the shingle, or the number of stacked positions we iterate through, as we can otherwise get a combinatorial explosion of terms. TRC found this by feeding long strings into a decompounding filter, and then building shingles of length 11. The decompounding filter was producing up to 50 tokens in the same position, which lead to 50^11 shingles being generated, resulting in OOM. I'm not sure of the best way of dealing with this one though - we could just limit shingle length to a maximum of 3 or 4, but that seems like too harsh a restriction for this. The other possibility would be to have a (configurable) maximum number of shingles emitted at a single position, and throw IllegalStateException if this is hit.

asfimport commented 6 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

+1 to set position length to 1, this is a fixed size shingle filter so there's no additional information in this attribute. Regarding the explosion of the number of terms can you track the total number of tokens that need to produce a shingle at the next position and ignore new tokens with posIncr=0 if the number is too high (1000 ?) ?

asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

I don't like to silently drop tokens though. I think throwing an exception is the way to go, with a clear message saying 'either prune your synonym tree, reduce the shingle size, or up the soft limit'

asfimport commented 6 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

sure +1 for the exception, I don't think that this limit should be configurable though, 1000 seems more than enough to handle normal cases ?

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

The decompounding filter was producing up to 50 tokens in the same position, which lead to 50^11 shingles being generated, resulting in OOM.

I agree that 50 tokens per position and a shingle length of 11 would create 50^11 shingles, but the filter would only keep 50*11 tokens in memory and generate shingles on the fly. So I would expect it to be slow, but not to use lots of memory? So maybe we have a bug?

I'm not sure of the best way of dealing with this one though - we could just limit shingle length to a maximum of 3 or 4, but that seems like too harsh a restriction for this. The other possibility would be to have a (configurable) maximum number of shingles emitted at a single position, and throw IllegalStateException if this is hit.

I would add both. I don't think restricting the shingle size would be too harsh, I don't remember of someone using a max shingle size that was greater than 3? Also limits are always easier to increase than decrease.

asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

So I would expect it to be slow, but not to use lots of memory?

BaseTokenStreamTestCase.checkAnalysisConsistency() stores all incoming tokens in a list so that it can then compare them against a second run, so it's a test issue rather than anything else.

I would add both

+1 - I'll add a max shingle size of four and a max number of stacks of 1000.

asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

I've attached the bugfix patch for reference.

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit 7e4358046eaf9c887ef3037f9ba3460b6bae5f06 in lucene-solr's branch refs/heads/branch_7x from @romseygeek https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7e43580

LUCENE-8202: Fix positionlength for FixedShingleFilter and add limits to shingle size and count

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit bbf8306615aa87d12f52113587140fb792d93cd5 in lucene-solr's branch refs/heads/master from @romseygeek https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bbf8306

LUCENE-8202: Fix positionlength for FixedShingleFilter and add limits to shingle size and count

apache / lucene

Add a FixedShingleFilter [LUCENE-8202] #9249