Closed asfimport closed 6 years ago
David Smiley (@dsmiley) (migrated from JIRA)
I suppose this is for fields that one might use in Solr "pf2" "pf3" etc ? Can ShingleGraphFilter do what this does too, albeit slower and with greater chance of bugs? It appears so. Maybe we only need one Factory, and the Factory can produce the Filter most appropriate based on the configuration?
Adrien Grand (@jpountz) (migrated from JIRA)
Thanks Alan, I agree this issue looks easier to address if we don't have to support all the features that ShingleFilter accumulated over the years and focus on index-time phrase tokenization. Some comments about the patch: - Can you make this filter experimental, ideally it would get merged with ShingleFilter eventually? - Can you clarify in the sentence about stacked shingles that this is how this filter differs from ShingleFilter. I think it's important for users to realize when it makes sense to use this one.
shingleSize
that you could search if you indexed with the wrapped analyzer directly? For instance if I look at the testMultiLengthPathShingles
test, it indexes the shingle "the usa is". But this is not a phrase that you can search if you create the same analysis chain without the shingle filter in the end ("the" and "usa" are at position 0 and 1, but "is" is at position 6).Alan Woodward (@romseygeek) (migrated from JIRA)
Updated patch, which I think should address all of Adrien's concerns.
advanceStack runs in quadratic time with shingleSize
I don't think it does? The inner loop just continues the outer loop, so it should be linear.
Maybe we only need one Factory
I'm really seeing these as two separate filters, one for query time that handles graphs, and one for index time that assumes a linear stream (or sausage :)), so I think a single factory would confuse things. But we can discuss that on the other issue, I think this is a simpler thing to get in first.
Adrien Grand (@jpountz) (migrated from JIRA)
The inner loop just continues the outer loop, so it should be linear.
I had read too quickly, thanks for clarifying.
Your reasoning about the dedicated factory also makes sense to me.
The latest patch looks good to me, I'd just like testing to be done on canned token streams rather than using other analysis components like the whitespace tokenizer or the synonym token filter. Other than that +1 to push.
Alan Woodward (@romseygeek) (migrated from JIRA)
Updated patch using CannedTokenStream for tests. Will commit shortly.
Adrien Grand (@jpountz) (migrated from JIRA)
+1
Jim Ferenczi (@jimczi) (migrated from JIRA)
+1, thanks Alan.
ASF subversion and git services (migrated from JIRA)
Commit 230a77ce38ebe6294a06aebf23d85b68223b6ec2 in lucene-solr's branch refs/heads/branch_7x from @romseygeek https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=230a77c
LUCENE-8202: Add FixedShingleFilter
ASF subversion and git services (migrated from JIRA)
Commit fac84c01c84b3693a8c1251ae77f349c38497e06 in lucene-solr's branch refs/heads/master from @romseygeek https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fac84c0
LUCENE-8202: Add FixedShingleFilter
ASF subversion and git services (migrated from JIRA)
Commit bd6cf168e0e129fa22545a7f614b2b146bd5f202 in lucene-solr's branch refs/heads/branch_7x from @romseygeek https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bd6cf16
LUCENE-8202: Add checks for shingle size
ASF subversion and git services (migrated from JIRA)
Commit 2c4b78c43fe2e30ef748af34a1daa174d66e29cc in lucene-solr's branch refs/heads/master from @romseygeek https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2c4b78c
LUCENE-8202: Add checks for shingle size
Alan Woodward (@romseygeek) (migrated from JIRA)
Reopening to address a couple of test failures
Alan Woodward (@romseygeek) (migrated from JIRA)
TestRandomChains has found two issues:
Jim Ferenczi (@jimczi) (migrated from JIRA)
+1 to set position length to 1, this is a fixed size shingle filter so there's no additional information in this attribute. Regarding the explosion of the number of terms can you track the total number of tokens that need to produce a shingle at the next position and ignore new tokens with posIncr=0 if the number is too high (1000 ?) ?
Alan Woodward (@romseygeek) (migrated from JIRA)
I don't like to silently drop tokens though. I think throwing an exception is the way to go, with a clear message saying 'either prune your synonym tree, reduce the shingle size, or up the soft limit'
Jim Ferenczi (@jimczi) (migrated from JIRA)
sure +1 for the exception, I don't think that this limit should be configurable though, 1000 seems more than enough to handle normal cases ?
Adrien Grand (@jpountz) (migrated from JIRA)
The decompounding filter was producing up to 50 tokens in the same position, which lead to 50^11 shingles being generated, resulting in OOM.
I agree that 50 tokens per position and a shingle length of 11 would create 50^11 shingles, but the filter would only keep 50*11 tokens in memory and generate shingles on the fly. So I would expect it to be slow, but not to use lots of memory? So maybe we have a bug?
I'm not sure of the best way of dealing with this one though - we could just limit shingle length to a maximum of 3 or 4, but that seems like too harsh a restriction for this. The other possibility would be to have a (configurable) maximum number of shingles emitted at a single position, and throw IllegalStateException if this is hit.
I would add both. I don't think restricting the shingle size would be too harsh, I don't remember of someone using a max shingle size that was greater than 3? Also limits are always easier to increase than decrease.
Alan Woodward (@romseygeek) (migrated from JIRA)
So I would expect it to be slow, but not to use lots of memory?
BaseTokenStreamTestCase.checkAnalysisConsistency() stores all incoming tokens in a list so that it can then compare them against a second run, so it's a test issue rather than anything else.
I would add both
+1 - I'll add a max shingle size of four and a max number of stacks of 1000.
Alan Woodward (@romseygeek) (migrated from JIRA)
I've attached the bugfix patch for reference.
ASF subversion and git services (migrated from JIRA)
Commit 7e4358046eaf9c887ef3037f9ba3460b6bae5f06 in lucene-solr's branch refs/heads/branch_7x from @romseygeek https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7e43580
LUCENE-8202: Fix positionlength for FixedShingleFilter and add limits to shingle size and count
ASF subversion and git services (migrated from JIRA)
Commit bbf8306615aa87d12f52113587140fb792d93cd5 in lucene-solr's branch refs/heads/master from @romseygeek https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bbf8306
LUCENE-8202: Fix positionlength for FixedShingleFilter and add limits to shingle size and count
In #4549 I tried to make a ShingleGraphFilter that could accept and emit arbitrary graphs, while duplicating all the functionality of the existing ShingleFilter. This ends up being extremely hairy, and doesn't play well with query parsers.
I'd like to step back and try and create a simpler shingle filter that can be used for index-time phrase tokenization only. It will have a single fixed shingle size, can deal with single-token synonyms, and won't emit unigrams.
Migrated from LUCENE-8202 by Alan Woodward (@romseygeek), resolved Mar 25 2018 Attachments: LUCENE-8202.patch (versions: 3), LUCENE-8202-fixes.patch