Add ShingleFilter option to output unigrams if no shingles can be generated [LUCENE-1370]

asfimport commented 16 years ago

Currently if ShingleFilter.outputUnigrams==false and the underlying token stream is only one token long, then ShingleFilter.next() won't return any tokens. This patch provides a new option, outputUnigramIfNoNgrams; if this option is set and the underlying stream is only one token long, then ShingleFilter will return that token, regardless of the setting of outputUnigrams.

My use case here is speeding up phrase queries. The technique is as follows:

First, doing index-time analysis using ShingleFilter (using outputUnigrams==true), thereby expanding things as follows:

"please divide this sentence into shingles" -> "please", "please divide" "divide", "divide this" "this", "this sentence" "sentence", "sentence into" "into", "into shingles" "shingles"

Second, do query-time analysis using ShingleFilter (using outputUnigrams==false and outputUnigramIfNoNgrams==true). If the user enters a phrase query, it will get tokenized in the following manner:

"please divide this sentence into shingles" -> "please divide" "divide this" "this sentence" "sentence into" "into shingles"

By doing phrase queries with bigrams like this, I can gain a very considerable speedup. Without the outputUnigramIfNoNgrams option, then a single word query would tokenize like this:

"please" -> [no tokens]

But thanks to outputUnigramIfNoNgrams, single words will now tokenize like this:

"please" -> "please"

****

The patch also adds a little to the pre-outputUnigramIfNoNgrams option tests.

****

I'm not sure if the patch in this state is useful to anyone else, but I thought I should throw it up here and try to find out.

Migrated from LUCENE-1370 by Chris Harris, 3 votes, resolved Oct 09 2010 Attachments: LUCENE-1370.patch (versions: 7), ShingleFilter.patch Linked issues:

SOLR-744

asfimport commented 16 years ago

Karl Wettin (migrated from JIRA)

It's an OK filter setting if you ask me.

However I'm curious to why you don't query for unigrams unless the input is a single token? That means you always require a 0-slop between any two tokens of the input. I know nothing about your needs, but it could be dangerous. You can always boost the bigrams a bit more than the unigrams if they cause a problem. I think you should benchmark the cost. I'm sure it's rather small and that you'll get better quality results by doing that. Users tend to never enter a query the way I want them to.

asfimport commented 16 years ago

Chris Harris (migrated from JIRA)

Karl, that's a good point about my setup being incompatible with non-0 slop. However, the performance gains I'm seeing with this patch on my data are substantial. When I last tested on the same index, same machine, same # of threads in my testing process, etc., and went from analyzing my queries with

outputUnigrams==true

to analyzing with

using outputUnigrams==false and outputUnigramIfNoNgrams==true

phrase queries ended up performing something like 50x as fast. Which is good, because the initial performance wasn't acceptable.

The performance gains from outputUnigramIfNoNgrams were greater than those from when I instead tried moving the index to a solid state drive. (It was a a fairly entry-level SSD drive, but still.) It would be interesting to compare to moving to a machine with an obscene amount of RAM. (Not quite sure what would count as "obscene", but my index is 90+GB. Maybe half of that is taken up by stored fields.)

asfimport commented 16 years ago

Karl Wettin (migrated from JIRA)

Do you say it is 50x faster with shingle queries that only contains bigram compared to shingle queries that contains uni- and bigrams? Or is it 50x faster using shingles compared to phrase queries? (I've myself seen performance gains similar to the latter.)

asfimport commented 16 years ago

Chris Harris (migrated from JIRA)

: Do you say it is 50x faster with shingle queries that only contains bigram : compared to shingle queries that contains uni- and bigrams? Or is it 50x : faster using shingles compared to phrase queries? (I've myself seen : performance gains similar to the latter.)

I'm not sure I totally understand you, so let me try rephrasing things. What I mean is that phrase queries that have only bigrams, like this:

PhraseQuery { "please divide" "divide this" "this sentence" "sentence into" "into shingles" }

run maybe 50x as fast as phrase queries that have both bigrams and unigrams, like this:

PhraseQuery { "please", "please divide" "divide", "divide this" "this", "this sentence" "sentence", "sentence into" "into", "into shingles" "shingles" }

If it clarifies things any further, let me say that I'm handling all quoted phrase queries with the normal PhraseQuery class; I'm not, for instance, turning quoted phrases into some kind of BooleanQuery. (Technically it's not me that's making the PhraseQuery object but Solr and its query parser. But Solr is indeed turning my quoted phrase queries into normal PhraseQuery objects.)

asfimport commented 16 years ago

Chris Harris (migrated from JIRA)

Fixing to merge cleanly against changes made in r687357. The patch file will also now have a proper name, LUCENE-1370.patch.

asfimport commented 16 years ago

Chris Harris (migrated from JIRA)

Getting rid of Windows-style newlines

asfimport commented 15 years ago

Chris Harris (migrated from JIRA)

Updating the patch to apply successfully against r813015, and hopefully thereby against the upcoming Lucene 2.9 release.

asfimport commented 15 years ago

Chris Harris (migrated from JIRA)

Any ideas on whether it will ever make sense for this patch to make it into the trunk? Some random thoughts:

The latest patch does make ShingleFilter marginally less clean. (Maybe my least favorite part is how fillShingleBuffer() is now responsible for setting firstToken; that it should do so is not obvious from the method name.)
This patch's functionality could potentially be implemented without modifying ShingleFilter itself. For example, maybe instead of patching ShingleFilter we could have a ShingleFilterUnigramWrapper class, that would delegate to ShingleFilter, except when ShingleFilter failed to produce ngrams. I'm a little worried that this would require using a CachingTokenFilter, and that might not be ideal from an efficiency perspective.
It might be good to rename outputUnigramIfNoNgrams to something like forceUnigramIfNoNgrams. With the current naming scheme, you end up setting some contradictory-sounding options, e.g. setting outputUnigrams==false and outputUnigramIfNoNgrams==true. If you look at the code this might not be confusing, but it'd be nice if it were more straightforward without making you look at the code.
I gather that some people have an interest in making a minShingleSize option. (See http://www.nabble.com/shingle-filter-td25125367.html) I'm not sure how best to modify this patch should that get implemented. It might depend on the typical use cases for minShingleSize, and if there's any overlap with my use case here.

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Chris, just a small comment:

  if (outputUnigramIfNoNgrams && firstToken == null) {
          firstToken = captureState();
  }
  shingleBuf.add(captureState());

here i think you could save a clone() by not calling captureState twice? even though it doesnt have to recompute the state, captureState does have to clone it.

(I am thinking about trying out your patch where i have some very short fields, this is why i mentioned it)

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

even though it doesnt have to recompute the state, captureState does have to clone it.

The recomputation is only done once for the lifetime of the TokenStream (as far as there are no modifications in the number of attributes). The State linked list is just another representation of the instances LinkedHashMap for faster iteration during cloning (no new Iterators to create and so on).

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Uwe, I see... i had not looked far enough yet to see how this was done, thanks.

but i think my comment still holds true, we can save a clone() here on the first token. for long fields does not matter so much, but for short fields a small savings.

asfimport commented 15 years ago

Chris Harris (migrated from JIRA)

here i think you could save a clone() by not calling captureState twice? even though it doesnt have to recompute the state, captureState does have to clone it.

So is the idea to replace

      if (getNextToken()) {
        if (outputUnigramIfNoNgrams && firstToken == null) {
          firstToken = captureState();
        }
        shingleBuf.add(captureState());

with

      if (getNextToken()) {
        State curState = captureState();
        if (outputUnigramIfNoNgrams && firstToken == null) {
          firstToken = curState;
        }
        shingleBuf.add(curState);

That seems fine, unless there's some hidden reason why you can't share State objects.

I'd guess you could optimize more than that, but I think you run into diminishing returns, making the code harder to read more than you're making it faster. For example:

     if (getNextToken()) {
        if (outputUnigramIfNoNgrams && firstToken == null) {
          firstToken = captureState();
          shingleBuf.add(firstToken);
        }
        else {
          shingleBuf.add(captureState());
        }

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Chris, yeah I was thinking something like your first method... i'm not concerned with any micro-optimization just reducing a clone if we can )

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

AttributeSource.State objects are unmodifiable (as the contents are private and so you have no access to the wrapped attributes).

asfimport commented 15 years ago

Chris Harris (migrated from JIRA)

Update patch to avoid an unnecessary State.clone(), as suggested by Robert.

asfimport commented 15 years ago

Chris Harris (migrated from JIRA)

At the risk of being annoying, is there any chance this patch (perhaps slightly refined, if you guys want) could make it into Lucene 3.0? I think I'm not the only person who wants to use the ShingleFilter in this slightly way. For example, I just noticed that the new Solr 1.4 Enterprise Search Server book (on p. 288) makes brief mention of SOLR-744, which depends on this patch. I'm fine keeping this out of the Lucene distribution if people think ShingleFilter.outputUnigrams is a silly hack and there's a better way to get the job done. But if this captures a legitimate use case, are there compelling reasons to keep it out?

Thanks.

asfimport commented 15 years ago

Karl Wettin (migrated from JIRA)

Oups, I seem to have assigned this to me and then forgotten about it. Sorry! I'll check it out this weekend!

asfimport commented 15 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Karl, are you working on this? Seems like this should not block 3.0 - feel free to move it to 3.1

simon

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

This patch is very straightforward, I think we could add it to 3.0, too. But 3.0 should not conatin new features, so move to 3.1?

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I move this to 3.1 as it is a new feature.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think this patch is really important to get into 3.x and trunk...

its a dead simple way (analysis configuration only) for folks to get a massive performance improvement for phrasequeries.

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

The previous patch predates the rewrite ShingleFilter was subjected to as a result of #3294, so it needed to be rejiggered somewhat.

Changes from the previous patch:

The new patch simply enables unigram output if the number of input tokens is less than minShingleSize. The existing code then handles the situation appropriately, and reset() restores the original unigram output option.
I renamed the option "outputUnigramIfNoNgrams" to "outputUnigramsIfNoShingles", because: ** Unigram -> Unigrams: the output could result in more than one unigram if minShingleSize is greater than the default 2; and ** Ngrams -> Shingles: for consistency with the class's name.
I renamed "returnedAnyTokensYet" to "noShingleOutput", and reversed its (boolean) sense, because: ** unigrams should be output only if no shingles can be output, rather than no tokens; and ** reversing the sense allowed the test using it to avoid negation, and allowed the name to be shorter.
I added a note to the setOutputUnigramsIfNoShingles() method javadoc to the effect that if outputUnigram == true, unigrams will always be output regardless of the setting of outputUnigramsIfNoShingles.
I added a test that makes sure that when minShingleSize > 2 and the number of input tokens is less than minShingleSize, (multiple) unigrams are output

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Added lucene/CHANGES.txt entry, and slightly modified reset() code switching back the unigram output option.

All tests pass.

If there are no objections, I'll commit this in a couple of days.

edit: I added an entry to *modules/analysis/*CHANGES.txt, not *lucene/*CHANGES.txt

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Looks good to me, I tested it and had no problems.

Maybe we should add the option to ShingleAnalyzerWrapper too? This would be very convenient for Lucene users.

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Looks good to me, I tested it and had no problems.

Thanks for testing, Robert.

Maybe we should add the option to ShingleAnalyzerWrapper too?

Yes, I missed that one - I've added support for it in this version of the patch, along with one test in ShingleAnalyzerWrapperTest (for the single-token case).

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Committed: trunk revision 1006187, branch_3x revision 1006195

asfimport commented 13 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Bulk close for 3.1

apache / lucene

Add ShingleFilter option to output unigrams if no shingles can be generated [LUCENE-1370] #2444