apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.69k stars 1.04k forks source link

A new token filter: SubSequence [LUCENE-5674] #6736

Open asfimport opened 10 years ago

asfimport commented 10 years ago

A new configurable token filter which, given a token breaks it into sub-parts and outputs consecutive sub-sequences of those sub-parts.

Useful for, for example, using during indexing to generate variations on domain names, so that "www.google.com" can be found by searching for "google.com", or "www.google.com".

Parameters:

sepRegexp: A regular expression used split incoming tokens into sub-parts. glue: A string used to concatenate sub-parts together when creating sub-sequences. minLen: Minimum length (in sub-parts) of output sub-sequences maxLen: Maximum length (in sub-parts) of output sub-sequences (0 for unlimited; negative numbers for token length in sub-parts minus specified length) anchor: Anchor.START to output only prefixes, or Anchor.END to output only suffixes, or Anchor.NONE to output any sub-sequence withOriginal: whether to output also the original token

EDIT: now includes tests for filter and for factory.


Migrated from LUCENE-5674 by Nitzan Shaked, updated Jun 01 2014 Attachments: subseqfilter.patch

asfimport commented 10 years ago

Nitzan Shaked (migrated from JIRA)

Updated patch, including tests

asfimport commented 10 years ago

Ahmet Arslan (@iorixxx) (migrated from JIRA)

anchorStr.toUpperCase()

does this pass ant precommit ?

asfimport commented 10 years ago

Nitzan Shaked (migrated from JIRA)

It did not. Added the obvious Locale.ROOT parameter. New patch (attaching in just a second) includes:

  1. Fix for .toUpperCase() as mentioned above
  2. Fix for other another ant precommit issue: rouge tabs
  3. A fix for the RandomChains test (did not allow SubSeqFilter.Anchor as a parameter type)
  4. Passes all tests
  5. Passes ant precommit, with the (possible?) exception of "documentation-lint", which I could not test
asfimport commented 10 years ago

Nitzan Shaked (migrated from JIRA)

Fixes for ant precommit

asfimport commented 10 years ago

Nitzan Shaked (migrated from JIRA)

Any word on this?

asfimport commented 10 years ago

Ahmet Arslan (@iorixxx) (migrated from JIRA)

I see that somehow your patch contains old versions of itself. It is hard to read. Can you create a patch that created against trunk? It would be nice to have a documentation describing functionality of this filter. And why we cannot achieve it with existing analysis components.

asfimport commented 10 years ago

Ahmet Arslan (@iorixxx) (migrated from JIRA)

What happens when this filter instantiated with a minLen greater than maxLen?

asfimport commented 10 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

Didn't look at this, but I remember needing/writing something like this 10+ years ago.... but I think back then I wanted to have output be something like: com, com.google, com.google.www - i.e. tokenized, but reversed order.

asfimport commented 10 years ago

Nitzan Shaked (migrated from JIRA)

Ahmet:

1) I'll attach a "squashed" version of the patch, without history, hopefully that'll be easier to read. 2) I don't know how to "prove" that something can't be done using existing analysis components, but after spending quite some time on this, and after asking on S.O., I am fairly convinced that it indeed cannot be done using existing components. 3) Instantiating with minLen>maxLen is ok, since maxLen can be negative (-2 to count 2 sub-tokens from the end, for example). It might also happen that minLen may be greater than some tokens' lengths. In those cases there will simply be no output for the given token. I'll add a check that when both minLen and maxLen are positive, then minLen <= maxLen.

Otis: while I'm adding this last check, I'll also add the "reverse" option, I can see why that might be useful.

asfimport commented 10 years ago

Nitzan Shaked (migrated from JIRA)

Latest patch, all commits squashed into one

asfimport commented 10 years ago

Nitzan Shaked (migrated from JIRA)

Ahmet, Otis: latest patch I just added contains everything mentioned above: check for minLen > maxLen (if maxLen > 0), a squahsed patch as per Ahmet's request, and a "reverse" feature (obviously with added tests).

asfimport commented 10 years ago

Ahmet Arslan (@iorixxx) (migrated from JIRA)

In one place; old header used ? ``` Copyright 2005 The Apache Software Foundation

asfimport commented 10 years ago

Koji Sekiguchi (@kojisekig) (migrated from JIRA)

Didn't look at this, but I remember needing/writing something like this 10+ years ago.... but I think back then I wanted to have output be something like: com, com.google, com.google.www - i.e. tokenized, but reversed order.

PathHierarchyTokenizer can tokenize something like that.

asfimport commented 10 years ago

Nitzan Shaked (migrated from JIRA)

Copied verbatim from files in the same dir. Where is the 'new' header, I'll replace.

While we're at it: anything else?

asfimport commented 10 years ago

Nitzan Shaked (migrated from JIRA)

Koji: it can't do what I'm trying to do. Have you looked at my description?

asfimport commented 10 years ago

Koji Sekiguchi (@kojisekig) (migrated from JIRA)

Koji: it can't do what I'm trying to do. Have you looked at my description?

Please ignore my comment Nitzan as it was just for what Otis described, and PathHierarchyTokenizer is a Tokenizer, not TokenFilter. :)

asfimport commented 10 years ago

Nitzan Shaked (migrated from JIRA)

Updated patch, contains "new format header" for the one place that used the "old format header"

asfimport commented 10 years ago

Nitzan Shaked (migrated from JIRA)

So what's up with this?