apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.64k stars 1.02k forks source link

ShingleFilter should handle positionIncrement of zero, e.g. synonyms [LUCENE-3475] #4549

Open asfimport opened 13 years ago

asfimport commented 13 years ago

ShingleFilter is creating shingles for a single term that has been expanded by synonyms when it shouldn't. The position increment is 0.

As an example, I have an Analyzer with a SynonymFilter followed by a ShingleFilter. Assuming car and auto are synonyms, the SynonymFilter produces two tokens and position 1: car, auto. The ShingleFilter is then producing 3 tokens, when there should only be two: car, car auto, auto. This behavior seems incorrect.


Migrated from LUCENE-3475 by Cameron, 4 votes, updated Mar 13 2018 Attachments: LUCENE-3475.patch (versions: 2) Linked issues:

asfimport commented 13 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

From the ShingleFilter javadocs: "This filter handles position increments > 1 by inserting filler tokens (tokens with termtext "_"). It does not handle a position increment of 0."

asfimport commented 13 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

See ShingleMatrixFilter for a shingle implementation that appears to support synonyms. But it's deprecated and slated for removal in Lucene 4.0. So caveat emptor.

asfimport commented 13 years ago

Cameron (migrated from JIRA)

"It does not handle a position increment of 0."

...I had interpreted that to mean it would just 'ignore' or not try to do anything with tokens where posIncr=0. I guess that is an incorrect assumption?

asfimport commented 13 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

That sentence was intended to warn users that ShingleFilter behavior was undefined over streams with position increment of zero.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think this is important, now that we have "graph" analyzers (like Kuromoji).

So ShingleFilter should pay attention to posInc as well as posLength...

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

5242 includes an example of what it would look like if ShingleFilter properly handled its input as a graph, i.e. respected posincr=0 and poslen>1.

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I'm not going to work on this in the short term but I just wrote a test case in case someone else has interest in fixing this issue:

  public void testSynonyms1() throws IOException {
    Token t1 = new Token("small", 0, 5);
    Token t2 = new Token("little", 0, 5);
    t2.setPositionIncrement(0);
    Token t3 = new Token("thing", 6, 11);
    TokenStream tk = new CannedTokenStream(t1, t2, t3);
    ShingleFilter f = new ShingleFilter(tk, 2, 2);
    f.setOutputUnigrams(false);
    assertTokenStreamContents(f,
        new String[] { "small thing", "little thing" },
        new int[] {0, 0},
        new int[] {11, 11},
        new int[] {1, 0});
  }

  public void testSynonyms2() throws IOException {
    Token t1 = new Token("small", 0, 5);
    Token t2 = new Token("thing", 6, 11);
    Token t3 = new Token("object", 6, 11);
    t3.setPositionIncrement(0);
    TokenStream tk = new CannedTokenStream(t1, t2, t3);
    ShingleFilter f = new ShingleFilter(tk, 2, 2);
    f.setOutputUnigrams(false);
    assertTokenStreamContents(f,
        new String[] { "small thing", "small object" },
        new int[] {0, 0},
        new int[] {11, 11},
        new int[] {1, 0});
  }
asfimport commented 6 years ago

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

@jpountz Hi Adrien!  I wonder what would be the best approach to handle positionIncrement=0?

I was thinking that in  ShingleFilter:getNextToken we can do something like this:

if (input.incrementToken()) {
   while (posIncrAtt.getPositionIncrement()  == 0) { // we may have multiple synonyms
      if (input.incrementToken()) { // go to next token 
           ....
          // store synonyms tokens and following tokens somewhere and create a new input TokenStream from them? 

I guess I am wondering if we have any other reference code that recreates a tokenStream from synonym tokens?  

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

@mayya-sharipova I think we need to do something like this indeed. We might also need `getNextTokenI don't think you should try to recreate a token stream from the synonyms, however it is true that you need to record their state somewhere. There are multiple ways to do this, for instance ShingleFilter uses \cloneAttributes, but there is alsocaptureState}}/\restoreState. You can look at \EdgeNGramTokenFilter` for an example of how to use them.

 

I haven't looked deeply at how the token filter currently works, but I wouldn't be surprised that significant refactorings are required to make things work, so feel free to reorganize the code if it makes this problem any easier to fix!

asfimport commented 6 years ago

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

@jpountz thank so much for suggestions.

@romseygeek is going to work on this issue. I will study his solution.

asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Here's a patch with a try at implementing a ShingleGraphFilter.  Still requires javadocs, more testing, randomized testing, adding into RandomChains, etc, but it would be good to get some more eyes on it.

It has a slightly different output to ShingleFilter on non-graph tokenstreams, in that it emits shingles longest first, due to the way I went about it.  It might not be too difficult to change that though.

To support backtracking in the TokenStream, the underlying stream is read (lazily) into a linked list, with Token objects reused once the filter has moved past them.  The current shingle is just an array of references into the list.  Shingles are built using the nextTokenInGraph(Token) method, which will use a token's length attribute to move through the linked list and find its successor.  If there are multiple tokens sharing a position, then incrementToken() will iterate through each one, rebuilding the graph each time.

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Exciting! I'll need some time to digest this new implementation but I already have some questions/remarks :)

asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

but won't generate a graph unless the input is a graph too

Correct

Should we just replace it?

I'll need to rejig it slightly so that it produces identical output, but I think that's worth doing.  And a drop-in replacement would be easier for upgrades as well.

I'll add some more tests, mixing up CannedTokenStream and real analyzer outputs.

asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Updated patch:

Some of this behaviour is different to the standard ShingleFilter, so I think we should probably keep them separate (while deprecating the old version and removing in master)

asfimport commented 6 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Some of this behaviour is different to the standard ShingleFilter, so I think we should probably keep them separate (while deprecating the old version and removing in master)

Wouldn't this be a Version switch situation instead? Why bother the users with a new name instead?

onyxmaster commented 1 year ago

Hi. Got bitten by this today after a lemmatizer filter produced two variants of base word at the same position and ShingleFilter producing a "shingle" from these variants, failing to produce a variant of first shingle: "<term1> <term2>" -> "<term1> <term2-1|term2-2>" -> ["<term1> <term2-1>", "<term2-1> <term2-2>"], where the correct result would be ["<term1> <term2-1>", "<term1> <term2-2>"].