CombiningFilter to recombine tokens into a single token for sorting [LUCENE-3413]

asfimport commented 13 years ago

I whipped up this CombiningFilter for the following use case:

I've got a bunch of titles of e.g., Books, such as:

The Grapes of Wrath Tommy Tommerson saves the World Top of the World The Tales of Beedle the Bard Born Free

etc.

I want to sort these titles using a String field that includes stopword analysis (e.g., to remove "The"), and synonym filtering (e.g., for grouping), etc. I created an analysis chain in Solr for this that was based off of alphaOnlySort, which looks like this:

<fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
   <analyzer>
        <!-- KeywordTokenizer does no actual tokenizing, so the entire
             input string is preserved as a single token
          -->
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <!-- The LowerCase TokenFilter does what you expect, which can be
             when you want your sorting to be case insensitive
          -->
        <filter class="solr.LowerCaseFilterFactory" />
        <!-- The TrimFilter removes any leading or trailing whitespace -->
        <filter class="solr.TrimFilterFactory" />
        <!-- The PatternReplaceFilter gives you the flexibility to use
             Java Regular expression to replace any sequence of characters
             matching a pattern with an arbitrary replacement string, 
             which may include back references to portions of the original
             string matched by the pattern.

             See the Java Regular Expression documentation for more
             information on pattern and replacement string syntax.

             http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html
          -->
        <filter class="solr.PatternReplaceFilterFactory"
                pattern="([^a-z])" replacement="" replace="all"
        /> 
    </analyzer>       
    </fieldType>

The issue with alphaOnlySort is that it doesn't support stopword remove or synonyms because those are based on the original token level instead of the full strings produced by the KeywordTokenizer (which does not do tokenization). I needed a filter that would allow me to change alphaOnlySort and its analysis chain from using KeywordTokenizer to using WhitespaceTokenizer, and then a way to recombine the tokens at the end. So, take "The Grapes of Wrath". I needed a way for it to get turned into:

grapes of wrath

And then to combine those tokens into a single token:

grapesofwrath

The attached CombiningFilter takes care of that. It doesn't do it super efficiently I'm guessing (since I used a StringBuffer), but I'm open to suggestions on how to make it better.

One other thing is that apparently this analyzer works fine for analysis (e.g., it produces the desired tokens), however, for sorting in Solr I'm getting null sort tokens. Need to figure out why.

Here ya go!

Migrated from LUCENE-3413 by Chris A. Mattmann, updated Jan 09 2013 Attachments: LUCENE-3413.Mattmann.090311.patch.txt, LUCENE-3413.Mattmann.090511.patch.txt

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

The problem with this implementation of the filter is the fact, that it consumes the underlying TokenStream in the constructor, concats everything and then wraps a KeywordTokenizer.

The problem is that the TokenFilters are not full initialized in the constructor.

The filter should do the mergin direct inside incrementToken():

on the first call to incrementToken, it should do a while(input.incrementToken()) loop and collect all tokens into a buffer and at then end copy the buffer into the term attribute
on second call return false

asfimport commented 13 years ago

Chris A. Mattmann (migrated from JIRA)

updated patch per Uwe's comments, and unit tests. Passes unit tests, but still fails in Solr ville to generate non-null sort strings.

asfimport commented 13 years ago

Chris A. Mattmann (migrated from JIRA)

Hey Uwe, thanks for the advice. I went ahead and updated the code (and attached a unit test). This patch, like my last one, passes the attached unit test. However, in Solr-land, when defining a customization of alphaOnlySort that uses the WhitespaceTokenizer (instead of the KeywordTokenizer), and then uses the CombiningFilter to merge the tokens at the end, analysis in solr's analysis.jsp looks fine, but I get null sort tokens (when I set fsv=true).

So, long story short, after I made the updates you suggested, I still get null sort keys. Any ideas?

asfimport commented 13 years ago

Chris A. Mattmann (migrated from JIRA)

For reference, here is the fieldType definition from Solr-ville that I am using:

    <fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>        
        <!-- The LowerCase TokenFilter does what you expect, which can be
             when you want your sorting to be case insensitive
          -->
        <filter class="solr.LowerCaseFilterFactory" />
        <!-- The TrimFilter removes any leading or trailing whitespace -->
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
        <!-- The PatternReplaceFilter gives you the flexibility to use
             Java Regular expression to replace any sequence of characters
             matching a pattern with an arbitrary replacement string, 
             which may include back references to portions of the original
             string matched by the pattern.

             See the Java Regular Expression documentation for more
             information on pattern and replacement string syntax.

             http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html
          -->          
        <filter class="solr.PatternReplaceFilterFactory"
                pattern="([^a-z])" replacement="" replace="all"
        />

        <filter class="org.apache.solr.analysis.CombiningFilterFactory"/>
      </analyzer>
    </fieldType>

asfimport commented 13 years ago

Chris A. Mattmann (migrated from JIRA)

Hmmm, maybe #reset is getting called somewhere. I wrote another unit test to call reset and then test calling incrementToken again. As it turns out, it fails, because calling input.reset in CombiningFilter calls e.g., LowerCaseFilter.reset, which in turn calls KeywordTokenizer.reset. The call to KeywordTokenizer.reset does nothing, and it just uses the stub method in TokenStream, even though KeywordTokenizer has a method #reset that takes a Reader input.

I wonder if the lack of having a working reset method is messing stuff up. What tells me that's probably wrong though is that LowerCaseFilter just uses the default parent class #reset (which just calls its input.reset), so I don't think that's an issue. Sigh.

asfimport commented 13 years ago

Chris A. Mattmann (migrated from JIRA)

ooops I meant WhitespaceTokenizer.reset, not KeywordTokenizer.reset. Sorry.

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

From a quick look at this:

What version of Solr is this against?
I believe your problem is that CombiningFilter is not resetting its firstCall variable. Therefore when the TokenFilter is reused, firstCall is always false and therefore incrementToken returns false (so nothing is ever emitted)

Add:

`@Override`
public void reset() {
  super.reset();
  this.firstCall = true;
}

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

why do you use the firstCall member at all? I mean you can just do:

final StringBuilder builder = new StringBuilder();
boolean returnVal = false;
while(input.incrementToken()) {
  returnVal = true;
  builder.append(ta.term());
}
ta.setTermBuffer(buf.toString());
return returnVal;

and you don't need a reset call.

StringBuffer btw. is almost never a good choice. Rather use StringBuilder

asfimport commented 13 years ago

Chris A. Mattmann (migrated from JIRA)

Thanks guys! Your updates fixed it! It's not sorting correctly!

I'll prepare two patches. One for Lucene that implements your suggestions. And another for Solr (containing the super trivial factory to instantiate this).

Thanks, again!

asfimport commented 13 years ago

Chris A. Mattmann (migrated from JIRA)

errr, I meant now instead of not. IOW, It's now sorting correctly. Thanks guys!

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I'll prepare two patches. One for Lucene that implements your suggestions. And another for Solr (containing the super trivial factory to instantiate this).

you can do it in one patch :)

asfimport commented 13 years ago

Chris A. Mattmann (migrated from JIRA)

updated patch addressing comments from Simon. Chris Male suggested renaming it, but I couldn't come up with a better name. Maybe we could call it CombiningTokenFilter, or something for specificity, but I'll leave that part up to you guys.

asfimport commented 13 years ago

Chris A. Mattmann (migrated from JIRA)

updated patch fix package names. This patch applies against the latest trunk.

asfimport commented 13 years ago

Chris A. Mattmann (migrated from JIRA)

final updated patch

asfimport commented 13 years ago

Chris A. Mattmann (migrated from JIRA)

BTW, I couldn't get it to work by removing the firstCall variable using Simon's suggestion, so I left it in there. If you guys want to figure it out, go for it, but the patch I attached right now is working...thanks!

asfimport commented 11 years ago

Chris A. Mattmann (migrated from JIRA)

Hi Guys, there seems to be some interest on list for such a capability: http://lucene.472066.n3.nabble.com/Which-token-filter-can-combine-2-terms-into-1-td4028482.html (or at least sounds similar). Any interest from someone to work with me to commit this?

asfimport commented 11 years ago

Lance Norskog (migrated from JIRA)

For sorting, would you want 'grapes_of_wrath"? This distinguishes the word 'grapes' from words that might start with 'grapes'. (I don't know of any, but you see the problem :)

Also, in this use case numerical canonicalization makes sense for searching and sorting. Twenty-two -> 22, and also 'twenty two' -> 22. Or maybe 'twenty two' -> 'twenty-two'.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

A few comments:

TestCombiningFilter should extend BaseTokenStreamTestCase, build an Analyzer with MockTokenizer+this filter and use BaseTokenStreamTestCase asserts (see http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/analysis/kuromoji/src/test/org/apache/lucene/analysis/ja/TestJapaneseKatakanaStemFilter.java?view=markup as a good example of an analysis unit test).
\@author tags should be removed
indentation should be 2 spaces not tabs.
instead of throwing away the return value of addAttribute(TermAttribute.class) in the ctor, just initialize this as an instance variable:

public class CombiningFilter extends TokenFilter {
  private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

This way you dont have to constantly look it up from the attribute map for each token, instead you just access "termAtt".

once the code is updated to CharTermAttribute, the various string creations can be eliminated, since it implements Appendable and CharSequence. so instead of

builder.append(ta.term());

just do:

builder.append(termAtt);

and same at the end, instead of

ta.setTermBuffer(builder.toString());

just do:

termAtt.setEmpty().append(builder);

in reset(), i would just call super.reset() instead of "this.input.reset()". This is a little cleaner and accomplishes the same thing (its how the other tokenfilters do this).

asfimport commented 11 years ago

Chris A. Mattmann (migrated from JIRA)

Thanks for the comments Robert. I'll take a pass at updating the patch per your comments. Lance, I think I get what you're saying. This is now in production at a fairly large company that I was doing consulting for and is working fine for their titles, etc, so I think it's still pretty useful.

asfimport commented 11 years ago

Alexandre Rafalovitch (@arafalov) (migrated from JIRA)

Any chance this filter could take an optional 'connector' parameter to put between tokens when joining them?

That way one could use '_' for sorting and (my need) a ' ' for recreating original string after stripping some token types.

asfimport commented 11 years ago

Chris A. Mattmann (migrated from JIRA)

Hey Alexandre happy to try and code it up if you find it useful. Still working on the update for Robert's review.

apache / lucene

CombiningFilter to recombine tokens into a single token for sorting [LUCENE-3413] #4486