Open asfimport opened 13 years ago
Steven Rowe (@sarowe) (migrated from JIRA)
From the ShingleFilter javadocs: "This filter handles position increments > 1 by inserting filler tokens (tokens with termtext "_"). It does not handle a position increment of 0."
Steven Rowe (@sarowe) (migrated from JIRA)
See ShingleMatrixFilter for a shingle implementation that appears to support synonyms. But it's deprecated and slated for removal in Lucene 4.0. So caveat emptor.
Cameron (migrated from JIRA)
"It does not handle a position increment of 0."
...I had interpreted that to mean it would just 'ignore' or not try to do anything with tokens where posIncr=0. I guess that is an incorrect assumption?
Steven Rowe (@sarowe) (migrated from JIRA)
That sentence was intended to warn users that ShingleFilter behavior was undefined over streams with position increment of zero.
Michael McCandless (@mikemccand) (migrated from JIRA)
I think this is important, now that we have "graph" analyzers (like Kuromoji).
So ShingleFilter should pay attention to posInc as well as posLength...
Steven Rowe (@sarowe) (migrated from JIRA)
Adrien Grand (@jpountz) (migrated from JIRA)
I'm not going to work on this in the short term but I just wrote a test case in case someone else has interest in fixing this issue:
public void testSynonyms1() throws IOException {
Token t1 = new Token("small", 0, 5);
Token t2 = new Token("little", 0, 5);
t2.setPositionIncrement(0);
Token t3 = new Token("thing", 6, 11);
TokenStream tk = new CannedTokenStream(t1, t2, t3);
ShingleFilter f = new ShingleFilter(tk, 2, 2);
f.setOutputUnigrams(false);
assertTokenStreamContents(f,
new String[] { "small thing", "little thing" },
new int[] {0, 0},
new int[] {11, 11},
new int[] {1, 0});
}
public void testSynonyms2() throws IOException {
Token t1 = new Token("small", 0, 5);
Token t2 = new Token("thing", 6, 11);
Token t3 = new Token("object", 6, 11);
t3.setPositionIncrement(0);
TokenStream tk = new CannedTokenStream(t1, t2, t3);
ShingleFilter f = new ShingleFilter(tk, 2, 2);
f.setOutputUnigrams(false);
assertTokenStreamContents(f,
new String[] { "small thing", "small object" },
new int[] {0, 0},
new int[] {11, 11},
new int[] {1, 0});
}
Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)
@jpountz Hi Adrien! I wonder what would be the best approach to handle positionIncrement=0?
I was thinking that in ShingleFilter:getNextToken we can do something like this:
if (input.incrementToken()) {
while (posIncrAtt.getPositionIncrement() == 0) { // we may have multiple synonyms
if (input.incrementToken()) { // go to next token
....
// store synonyms tokens and following tokens somewhere and create a new input TokenStream from them?
I guess I am wondering if we have any other reference code that recreates a tokenStream from synonym tokens?
Adrien Grand (@jpountz) (migrated from JIRA)
@mayya-sharipova I think we need to do something like this indeed. We might also need `getNextTokenI don't think you should try to recreate a token stream from the synonyms, however it is true that you need to record their state somewhere. There are multiple ways to do this, for instance ShingleFilter uses \
cloneAttributes, but there is also
captureState}}/\
restoreState. You can look at \
EdgeNGramTokenFilter` for an example of how to use them.
I haven't looked deeply at how the token filter currently works, but I wouldn't be surprised that significant refactorings are required to make things work, so feel free to reorganize the code if it makes this problem any easier to fix!
Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)
@jpountz thank so much for suggestions.
@romseygeek is going to work on this issue. I will study his solution.
Alan Woodward (@romseygeek) (migrated from JIRA)
Here's a patch with a try at implementing a ShingleGraphFilter. Still requires javadocs, more testing, randomized testing, adding into RandomChains, etc, but it would be good to get some more eyes on it.
It has a slightly different output to ShingleFilter on non-graph tokenstreams, in that it emits shingles longest first, due to the way I went about it. It might not be too difficult to change that though.
To support backtracking in the TokenStream, the underlying stream is read (lazily) into a linked list, with Token objects reused once the filter has moved past them. The current shingle is just an array of references into the list. Shingles are built using the nextTokenInGraph(Token) method, which will use a token's length attribute to move through the linked list and find its successor. If there are multiple tokens sharing a position, then incrementToken() will iterate through each one, rebuilding the graph each time.
Adrien Grand (@jpountz) (migrated from JIRA)
Exciting! I'll need some time to digest this new implementation but I already have some questions/remarks :)
Alan Woodward (@romseygeek) (migrated from JIRA)
but won't generate a graph unless the input is a graph too
Correct
Should we just replace it?
I'll need to rejig it slightly so that it produces identical output, but I think that's worth doing. And a drop-in replacement would be easier for upgrades as well.
I'll add some more tests, mixing up CannedTokenStream and real analyzer outputs.
Alan Woodward (@romseygeek) (migrated from JIRA)
Updated patch:
Some of this behaviour is different to the standard ShingleFilter, so I think we should probably keep them separate (while deprecating the old version and removing in master)
David Smiley (@dsmiley) (migrated from JIRA)
Some of this behaviour is different to the standard ShingleFilter, so I think we should probably keep them separate (while deprecating the old version and removing in master)
Wouldn't this be a Version switch situation instead? Why bother the users with a new name instead?
Hi. Got bitten by this today after a lemmatizer filter produced two variants of base word at the same position and ShingleFilter producing a "shingle" from these variants, failing to produce a variant of first shingle:
"<term1> <term2>" -> "<term1> <term2-1|term2-2>" -> ["<term1> <term2-1>", "<term2-1> <term2-2>"]
, where the correct result would be ["<term1> <term2-1>", "<term1> <term2-2>"]
.
ShingleFilter is creating shingles for a single term that has been expanded by synonyms when it shouldn't. The position increment is 0.
As an example, I have an Analyzer with a SynonymFilter followed by a ShingleFilter. Assuming car and auto are synonyms, the SynonymFilter produces two tokens and position 1: car, auto. The ShingleFilter is then producing 3 tokens, when there should only be two: car, car auto, auto. This behavior seems incorrect.
Migrated from LUCENE-3475 by Cameron, 4 votes, updated Mar 13 2018 Attachments: LUCENE-3475.patch (versions: 2) Linked issues:
5242