apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.73k stars 1.05k forks source link

NGramFilter -- construct n-grams from a TokenStream [LUCENE-400] #1478

Closed asfimport closed 16 years ago

asfimport commented 19 years ago

This filter constructs n-grams (token combinations up to a fixed size, sometimes called "shingles") from a token stream.

The filter sets start offsets, end offsets and position increments, so highlighting and phrase queries should work.

Position increments > 1 in the input stream are replaced by filler tokens (tokens with termText "_" and endOffset - startOffset = 0) in the output n-grams. (Position increments > 1 in the input stream are usually caused by removing some tokens, eg. stopwords, from a stream.)

The filter uses CircularFifoBuffer and UnboundedFifoBuffer from Apache Commons-Collections.

Filter, test case and an analyzer are attached.


Migrated from LUCENE-400 by Sebastian Kirsch, 5 votes, resolved Mar 29 2008 Environment:

Operating System: All
Platform: All

Attachments: ASF.LICENSE.NOT.GRANTED--NGramAnalyzerWrapper.java, ASF.LICENSE.NOT.GRANTED--NGramAnalyzerWrapperTest.java, ASF.LICENSE.NOT.GRANTED--NGramFilter.java, ASF.LICENSE.NOT.GRANTED--NGramFilterTest.java, LUCENE-400.patch

asfimport commented 19 years ago

Sebastian Kirsch (migrated from JIRA)

Created an attachment (id=15504) NGramFilter

asfimport commented 19 years ago

Sebastian Kirsch (migrated from JIRA)

Created an attachment (id=15505) NGramAnalyzerWrapper (wraps an NGramFilter around an analyzer.)

asfimport commented 19 years ago

Sebastian Kirsch (migrated from JIRA)

Created an attachment (id=15506) JUnit TestCase for NGramFilter

asfimport commented 19 years ago

Robert Newson (migrated from JIRA)

* <p>For example, the sentence "please divide this sentence into ngrams" would be

The comment should read;

asfimport commented 19 years ago

Sebastian Kirsch (migrated from JIRA)

Created an attachment (id=15818) JUnit test class for NGramAnalyzerWrapper

The tests in this class are concerned with the interaction between QueryParser and an NGramAnalyzer, and whether searching works as expected on an index constructed with an NGramAnalyzer.

One of the test cases throws an exception that I haven't investigated yet. So proceed with caution if you use the QueryParser with NGramAnalyzer.

..E.... Time: 1.771 There was 1 error: 1) testNGramAnalyzerWrapperPhraseQueryParsingFails(org.apache.lucene.analysis.NGramAnalyzerWrapperTest)java.lang.NullPointerException

at

org.apache.lucene.index.MultipleTermPositions.skipTo(MultipleTermPositions.java:178)

at

org.apache.lucene.search.PhrasePositions.skipTo(PhrasePositions.java:47) at org.apache.lucene.search.PhraseScorer.doNext(PhraseScorer.java:73) at org.apache.lucene.search.PhraseScorer.next(PhraseScorer.java:66) at org.apache.lucene.search.Scorer.score(Scorer.java:47) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:102) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65) at org.apache.lucene.search.Hits.<init>(Hits.java:44) at org.apache.lucene.search.Searcher.search(Searcher.java:40) at org.apache.lucene.search.Searcher.search(Searcher.java:32) at org.apache.lucene.analysis.NGramAnalyzerWrapperTest.queryParsingTest(NGramAnalyzerWrapperTest.java:75)

at

org.apache.lucene.analysis.NGramAnalyzerWrapperTest.testNGramAnalyzerWrapperPhraseQueryParsingFails(NGramAnalyzerWrapperTest.java:100)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at

org.apache.lucene.analysis.NGramAnalyzerWrapperTest.main(NGramAnalyzerWrapperTest.java:36)

FAILURES!!! Tests run: 6, Failures: 0, Errors: 1

asfimport commented 18 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

Sebastian, ever figured out the problem? Also, is there a way to get rid of the Commons Collections? Lucene has no run-time dependencies on other libraries.

asfimport commented 18 years ago

Sebastian Kirsch (migrated from JIRA)

Hi Otis,

I did not figure out the problem. Getting rid of Commons Collection should be no problem; I am just using them as FIFOs. However, I do not have the time at the moment to implement this.

Kind regards, Sebastian

asfimport commented 16 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Lucene has NGram support

asfimport commented 16 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Lucene has character NGram support, but not word NGram support, which this filter supplies:

This filter constructs n-grams (token combinations up to a fixed size, sometimes called "shingles") from a token stream.

asfimport commented 16 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Good catch, Steve. I will reopen, as a word based ngram filter is useful.

asfimport commented 16 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Repackaged these four files as a patch, with the following modifications to the code:

All tests pass.

Although I left in the ShingleAnalyzerWrapper and its test in the patch, no other Lucene filter (AFAICT) has such a filter wrapping facility. My vote is to remove these two files.

asfimport commented 16 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Thanks, Steve. I will mark this as 2.4

asfimport commented 16 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Removed the duplicate link (to LUCENE-759), since that issue is about character-level n-grams, and this issue is about word-level n-grams.

asfimport commented 16 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

Thanks for bringing this up to date. I'll commit it after 2.3 is out.

asfimport commented 16 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

ping, Otis, do you still plan to commit?

asfimport commented 16 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

re-ping, Otis, do you still plan to commit?

asfimport commented 16 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

Sorry for hogging. Got some local compilation issues with the query builder in contrib, so assigning to Grant to get this in.

asfimport commented 16 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Committed revision 642612.

Thanks Sebastian and Steve