Closed asfimport closed 16 years ago
Sebastian Kirsch (migrated from JIRA)
Created an attachment (id=15504) NGramFilter
Sebastian Kirsch (migrated from JIRA)
Created an attachment (id=15505) NGramAnalyzerWrapper (wraps an NGramFilter around an analyzer.)
Sebastian Kirsch (migrated from JIRA)
Created an attachment (id=15506) JUnit TestCase for NGramFilter
Robert Newson (migrated from JIRA)
* <p>For example, the sentence "please divide this sentence into ngrams" would be
The comment should read;
Sebastian Kirsch (migrated from JIRA)
Created an attachment (id=15818) JUnit test class for NGramAnalyzerWrapper
The tests in this class are concerned with the interaction between QueryParser and an NGramAnalyzer, and whether searching works as expected on an index constructed with an NGramAnalyzer.
One of the test cases throws an exception that I haven't investigated yet. So proceed with caution if you use the QueryParser with NGramAnalyzer.
..E.... Time: 1.771 There was 1 error: 1) testNGramAnalyzerWrapperPhraseQueryParsingFails(org.apache.lucene.analysis.NGramAnalyzerWrapperTest)java.lang.NullPointerException
at
org.apache.lucene.index.MultipleTermPositions.skipTo(MultipleTermPositions.java:178)
at
org.apache.lucene.search.PhrasePositions.skipTo(PhrasePositions.java:47) at org.apache.lucene.search.PhraseScorer.doNext(PhraseScorer.java:73) at org.apache.lucene.search.PhraseScorer.next(PhraseScorer.java:66) at org.apache.lucene.search.Scorer.score(Scorer.java:47) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:102) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65) at org.apache.lucene.search.Hits.<init>(Hits.java:44) at org.apache.lucene.search.Searcher.search(Searcher.java:40) at org.apache.lucene.search.Searcher.search(Searcher.java:32) at org.apache.lucene.analysis.NGramAnalyzerWrapperTest.queryParsingTest(NGramAnalyzerWrapperTest.java:75)
at
org.apache.lucene.analysis.NGramAnalyzerWrapperTest.testNGramAnalyzerWrapperPhraseQueryParsingFails(NGramAnalyzerWrapperTest.java:100)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at
org.apache.lucene.analysis.NGramAnalyzerWrapperTest.main(NGramAnalyzerWrapperTest.java:36)
FAILURES!!! Tests run: 6, Failures: 0, Errors: 1
Otis Gospodnetic (@otisg) (migrated from JIRA)
Sebastian, ever figured out the problem? Also, is there a way to get rid of the Commons Collections? Lucene has no run-time dependencies on other libraries.
Sebastian Kirsch (migrated from JIRA)
Hi Otis,
I did not figure out the problem. Getting rid of Commons Collection should be no problem; I am just using them as FIFOs. However, I do not have the time at the moment to implement this.
Kind regards, Sebastian
Grant Ingersoll (@gsingers) (migrated from JIRA)
Lucene has NGram support
Steven Rowe (@sarowe) (migrated from JIRA)
Lucene has character NGram support, but not word NGram support, which this filter supplies:
This filter constructs n-grams (token combinations up to a fixed size, sometimes called "shingles") from a token stream.
Grant Ingersoll (@gsingers) (migrated from JIRA)
Good catch, Steve. I will reopen, as a word based ngram filter is useful.
Steven Rowe (@sarowe) (migrated from JIRA)
Repackaged these four files as a patch, with the following modifications to the code:
@author
from javadocsAll tests pass.
Although I left in the ShingleAnalyzerWrapper and its test in the patch, no other Lucene filter (AFAICT) has such a filter wrapping facility. My vote is to remove these two files.
Grant Ingersoll (@gsingers) (migrated from JIRA)
Thanks, Steve. I will mark this as 2.4
Steven Rowe (@sarowe) (migrated from JIRA)
Removed the duplicate link (to LUCENE-759), since that issue is about character-level n-grams, and this issue is about word-level n-grams.
Otis Gospodnetic (@otisg) (migrated from JIRA)
Thanks for bringing this up to date. I'll commit it after 2.3 is out.
Grant Ingersoll (@gsingers) (migrated from JIRA)
ping, Otis, do you still plan to commit?
Steven Rowe (@sarowe) (migrated from JIRA)
re-ping, Otis, do you still plan to commit?
Otis Gospodnetic (@otisg) (migrated from JIRA)
Sorry for hogging. Got some local compilation issues with the query builder in contrib, so assigning to Grant to get this in.
Grant Ingersoll (@gsingers) (migrated from JIRA)
Committed revision 642612.
Thanks Sebastian and Steve
This filter constructs n-grams (token combinations up to a fixed size, sometimes called "shingles") from a token stream.
The filter sets start offsets, end offsets and position increments, so highlighting and phrase queries should work.
Position increments > 1 in the input stream are replaced by filler tokens (tokens with termText "_" and endOffset - startOffset = 0) in the output n-grams. (Position increments > 1 in the input stream are usually caused by removing some tokens, eg. stopwords, from a stream.)
The filter uses CircularFifoBuffer and UnboundedFifoBuffer from Apache Commons-Collections.
Filter, test case and an analyzer are attached.
Migrated from LUCENE-400 by Sebastian Kirsch, 5 votes, resolved Mar 29 2008 Environment:
Attachments: ASF.LICENSE.NOT.GRANTED--NGramAnalyzerWrapper.java, ASF.LICENSE.NOT.GRANTED--NGramAnalyzerWrapperTest.java, ASF.LICENSE.NOT.GRANTED--NGramFilter.java, ASF.LICENSE.NOT.GRANTED--NGramFilterTest.java, LUCENE-400.patch