NGramFilter -- construct n-grams from a TokenStream [LUCENE-400]

asfimport commented 19 years ago

This filter constructs n-grams (token combinations up to a fixed size, sometimes called "shingles") from a token stream.

The filter sets start offsets, end offsets and position increments, so highlighting and phrase queries should work.

Position increments > 1 in the input stream are replaced by filler tokens (tokens with termText "_" and endOffset - startOffset = 0) in the output n-grams. (Position increments > 1 in the input stream are usually caused by removing some tokens, eg. stopwords, from a stream.)

The filter uses CircularFifoBuffer and UnboundedFifoBuffer from Apache Commons-Collections.

Filter, test case and an analyzer are attached.

Migrated from LUCENE-400 by Sebastian Kirsch, 5 votes, resolved Mar 29 2008 Environment:

Operating System: All
Platform: All

Attachments: ASF.LICENSE.NOT.GRANTED--NGramAnalyzerWrapper.java, ASF.LICENSE.NOT.GRANTED--NGramAnalyzerWrapperTest.java, ASF.LICENSE.NOT.GRANTED--NGramFilter.java, ASF.LICENSE.NOT.GRANTED--NGramFilterTest.java, LUCENE-400.patch

asfimport commented 19 years ago

Sebastian Kirsch (migrated from JIRA)

Created an attachment (id=15504) NGramFilter

asfimport commented 19 years ago

Sebastian Kirsch (migrated from JIRA)

Created an attachment (id=15505) NGramAnalyzerWrapper (wraps an NGramFilter around an analyzer.)

asfimport commented 19 years ago

Sebastian Kirsch (migrated from JIRA)

Created an attachment (id=15506) JUnit TestCase for NGramFilter

asfimport commented 19 years ago

Robert Newson (migrated from JIRA)

* <p>For example, the sentence "please divide this sentence into ngrams" would be

tokenized into the tokens "please divide", "this sentence", "sentence into", and
"into ngrams".

The comment should read;

<p>For example, the sentence "please divide this sentence into ngrams" would be
tokenized into the tokens "please divide", "divide this", "this sentence", "sentence into", and
"into ngrams".

asfimport commented 19 years ago

Sebastian Kirsch (migrated from JIRA)

Created an attachment (id=15818) JUnit test class for NGramAnalyzerWrapper

The tests in this class are concerned with the interaction between QueryParser and an NGramAnalyzer, and whether searching works as expected on an index constructed with an NGramAnalyzer.

One of the test cases throws an exception that I haven't investigated yet. So proceed with caution if you use the QueryParser with NGramAnalyzer.

..E.... Time: 1.771 There was 1 error: 1) testNGramAnalyzerWrapperPhraseQueryParsingFails(org.apache.lucene.analysis.NGramAnalyzerWrapperTest)java.lang.NullPointerException

at

org.apache.lucene.index.MultipleTermPositions.skipTo(MultipleTermPositions.java:178)

at

org.apache.lucene.search.PhrasePositions.skipTo(PhrasePositions.java:47) at org.apache.lucene.search.PhraseScorer.doNext(PhraseScorer.java:73) at org.apache.lucene.search.PhraseScorer.next(PhraseScorer.java:66) at org.apache.lucene.search.Scorer.score(Scorer.java:47) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:102) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65) at org.apache.lucene.search.Hits.<init>(Hits.java:44) at org.apache.lucene.search.Searcher.search(Searcher.java:40) at org.apache.lucene.search.Searcher.search(Searcher.java:32) at org.apache.lucene.analysis.NGramAnalyzerWrapperTest.queryParsingTest(NGramAnalyzerWrapperTest.java:75)

at

org.apache.lucene.analysis.NGramAnalyzerWrapperTest.testNGramAnalyzerWrapperPhraseQueryParsingFails(NGramAnalyzerWrapperTest.java:100)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at

org.apache.lucene.analysis.NGramAnalyzerWrapperTest.main(NGramAnalyzerWrapperTest.java:36)

FAILURES!!! Tests run: 6, Failures: 0, Errors: 1

asfimport commented 18 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

Sebastian, ever figured out the problem? Also, is there a way to get rid of the Commons Collections? Lucene has no run-time dependencies on other libraries.

asfimport commented 18 years ago

Sebastian Kirsch (migrated from JIRA)

Hi Otis,

I did not figure out the problem. Getting rid of Commons Collection should be no problem; I am just using them as FIFOs. However, I do not have the time at the moment to implement this.

Kind regards, Sebastian