Open asfimport opened 10 years ago
Nitzan Shaked (migrated from JIRA)
Updated patch, including tests
Nitzan Shaked (migrated from JIRA)
It did not. Added the obvious Locale.ROOT
parameter. New patch (attaching in just a second) includes:
.toUpperCase()
as mentioned aboveant precommit
issue: rouge tabsant precommit
, with the (possible?) exception of "documentation-lint", which I could not testNitzan Shaked (migrated from JIRA)
Fixes for ant precommit
Nitzan Shaked (migrated from JIRA)
Any word on this?
Ahmet Arslan (@iorixxx) (migrated from JIRA)
I see that somehow your patch contains old versions of itself. It is hard to read. Can you create a patch that created against trunk? It would be nice to have a documentation describing functionality of this filter. And why we cannot achieve it with existing analysis components.
Ahmet Arslan (@iorixxx) (migrated from JIRA)
What happens when this filter instantiated with a minLen greater than maxLen?
Otis Gospodnetic (@otisg) (migrated from JIRA)
Didn't look at this, but I remember needing/writing something like this 10+ years ago.... but I think back then I wanted to have output be something like: com, com.google, com.google.www - i.e. tokenized, but reversed order.
Nitzan Shaked (migrated from JIRA)
Ahmet:
1) I'll attach a "squashed" version of the patch, without history, hopefully that'll be easier to read. 2) I don't know how to "prove" that something can't be done using existing analysis components, but after spending quite some time on this, and after asking on S.O., I am fairly convinced that it indeed cannot be done using existing components. 3) Instantiating with minLen>maxLen is ok, since maxLen can be negative (-2 to count 2 sub-tokens from the end, for example). It might also happen that minLen may be greater than some tokens' lengths. In those cases there will simply be no output for the given token. I'll add a check that when both minLen and maxLen are positive, then minLen <= maxLen.
Otis: while I'm adding this last check, I'll also add the "reverse" option, I can see why that might be useful.
Nitzan Shaked (migrated from JIRA)
Latest patch, all commits squashed into one
Nitzan Shaked (migrated from JIRA)
Ahmet, Otis: latest patch I just added contains everything mentioned above: check for minLen > maxLen (if maxLen > 0), a squahsed patch as per Ahmet's request, and a "reverse" feature (obviously with added tests).
Ahmet Arslan (@iorixxx) (migrated from JIRA)
In one place; old header used ? ``` Copyright 2005 The Apache Software Foundation
Koji Sekiguchi (@kojisekig) (migrated from JIRA)
Didn't look at this, but I remember needing/writing something like this 10+ years ago.... but I think back then I wanted to have output be something like: com, com.google, com.google.www - i.e. tokenized, but reversed order.
PathHierarchyTokenizer can tokenize something like that.
Nitzan Shaked (migrated from JIRA)
Copied verbatim from files in the same dir. Where is the 'new' header, I'll replace.
While we're at it: anything else?
Nitzan Shaked (migrated from JIRA)
Koji: it can't do what I'm trying to do. Have you looked at my description?
Koji Sekiguchi (@kojisekig) (migrated from JIRA)
Koji: it can't do what I'm trying to do. Have you looked at my description?
Please ignore my comment Nitzan as it was just for what Otis described, and PathHierarchyTokenizer is a Tokenizer, not TokenFilter. :)
Nitzan Shaked (migrated from JIRA)
Updated patch, contains "new format header" for the one place that used the "old format header"
Nitzan Shaked (migrated from JIRA)
So what's up with this?
A new configurable token filter which, given a token breaks it into sub-parts and outputs consecutive sub-sequences of those sub-parts.
Useful for, for example, using during indexing to generate variations on domain names, so that "www.google.com" can be found by searching for "google.com", or "www.google.com".
Parameters:
sepRegexp: A regular expression used split incoming tokens into sub-parts. glue: A string used to concatenate sub-parts together when creating sub-sequences. minLen: Minimum length (in sub-parts) of output sub-sequences maxLen: Maximum length (in sub-parts) of output sub-sequences (0 for unlimited; negative numbers for token length in sub-parts minus specified length) anchor: Anchor.START to output only prefixes, or Anchor.END to output only suffixes, or Anchor.NONE to output any sub-sequence withOriginal: whether to output also the original token
EDIT: now includes tests for filter and for factory.
Migrated from LUCENE-5674 by Nitzan Shaked, updated Jun 01 2014 Attachments: subseqfilter.patch