Deprecate/remove language-specific tokenizers in favor of StandardTokenizer [LUCENE-2747]

asfimport commented 13 years ago

As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization. Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0). The language-specific analyzers, by contrast, should remain, because they contain language-specific post-tokenization filters. The language-specific analyzers should switch to StandardTokenizer in 3.1.

Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.

For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary. Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

Migrated from LUCENE-2747 by Steven Rowe (@sarowe), resolved Dec 07 2010 Attachments: LUCENE-2747.patch (versions: 2)

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

here's a quick stab at a patch.

I had to add at least minimal support to ReusableAnalyzerBase in case you want charfilters, since it doesn't have any today.

maybe there is a better way to do it though.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

CharFilter must at least also implement read() to read one char. This is not needed by most tokenizers, but like that its incomplete. It does not need to impl char[] without offset and count.

Else looks fine for now in reusable.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

CharFilter must at least also implement read() to read one char.

Thats incorrect. only read(char[] cbuf, int off, int len) is abstract in Reader. CharStream extends Reader, but only adds correctOffset. CharFilter extends CharStream, but only delegates read(char[] cbuf, int off, int len)

So implementing read() only adds useless code duplication here.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Ay, ay "Code Dup Policeman".

From perf standpoint for real FilterReaders in java.io that would be no-go, but here it's fine as Tokenizers always buffer. Also java.io's FilterReader are different and delegate this method, but not CharFilter.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Wait, thats an interesting point, any advantage to actually using "real FilterReaders" for this API?

asfimport commented 13 years ago

DM Smith (migrated from JIRA)

I'm not too keen on this. For classics and ancient texts the standard analyzer is not as good as the simple analyzer. I think it is important to have a tokenizer that does not try to be too smart. I think it'd be good to have a SimpleAnalyzer based upon UAX#29, too.

Then I'd be happy.

asfimport commented 13 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Robert, your patch looks good - I have a couple of questions:

You removed TestHindiFilters.testTokenizer(), TestIndicTokenizer.testBasics() and TestIndicTokenizer.testFormat(), but these would be useful in TestStandardAnalyzer and TestUAX29Tokenizer, wouldn't they?
You did not remove ArabicLetterTokenizer and IndicTokenizer, presumably so that they can be used with Lucene 4.0+ when the supplied Version is less than 3.1 – good catch, I had forgotten this requirement – but when can we actually get rid of these? Since they will be staying, shouldn't their tests remain too, but using Version.LUCENE_30 instead of TEST_VERSION_CURRENT?

asfimport commented 13 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I'm not too keen on this. For classics and ancient texts the standard analyzer is not as good as the simple analyzer. I think it is important to have a tokenizer that does not try to be too smart. I think it'd be good to have a SimpleAnalyzer based upon UAX#29, too.

UAX29Tokenizer could be combined with LowercaseFilter to provide that, no?

Robert is arguing in the reopened #3243 for StandardTokenizer to be stripped down so that it only implements UAX#29 rules (i.e., dropping URL+Email recognition), so if that comes to pass, StandardAnalyzer would just be UAX#29+lowercase+stopword (with English stopwords by default, but those can be overridden in the ctor) – would that make you happy?

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

You removed TestHindiFilters.testTokenizer(), TestIndicTokenizer.testBasics() and TestIndicTokenizer.testFormat(), but these would be useful in TestStandardAnalyzer and TestUAX29Tokenizer, wouldn't they?

oh, i just deleted everything associated with that tokenizer...

You did not remove ArabicLetterTokenizer and IndicTokenizer, presumably so that they can be used with Lucene 4.0+ when the supplied Version is less than 3.1 - good catch, I had forgotten this requirement - but when can we actually get rid of these? Since they will be staying, shouldn't their tests remain too, but using Version.LUCENE_30 instead of TEST_VERSION_CURRENT?

i removed the indictokenizer (unreleased) and deleted its tests. but i kept and deprecated the arabic one, since we have released it.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I looked at the patch briefly and the charStream(Reader) extension looks good to me while I would make it protected and throw a IOException. Since this API is public and folks will use it in the wild we need to make sure we don't have to add the exception later or people creating Readers have to play tricks just because the interface has no IOException. About making it protected, do we need to call that in a non-protected context, maybe I miss something..

 public Reader charStream(Reader reader) {
   return reader;
 }

// should be 

  protected Reader charStream(Reader reader) throws IOException{
    return reader;
  }

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Simon: i agree with both those points... we should change the method signature.

also i called it charStream (this is what Solr's analyzer calls it), but this is slightly confusing since the api is all Reader-based. alternatively we could give this a different name, wrapReader or something... not sure, i didnt have any better ideas than charStream.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

...alternatively we could give this a different name, wrapReader or something... not sure, i didnt have any better ideas than charStream.

wrapReader seem to be to specific what about initReader?

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'm not too keen on this. For classics and ancient texts the standard analyzer is not as good as the simple analyzer.

DM, can you elaborate here?

Are you speaking of the existing StandardAnalyzer in previous releases, that doesn't properly deal with tokenizing diacritics, etc? This is the reason these "special" tokenizers exist: to work around those bugs. but StandardTokenizer now handles this stuff fine, and they are obselete.

I'm confused though, in previous releases how SimpleAnalyzer would ever be any better, since it would barf on these diacritics too, it only emits tokens that are runs of Character.isLetter

Or is there something else i'm missing here?

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

here's an updated patch.

in reality the previous patch was a problem: because initReader() was in the TokenStream components, it caused code duplication in any Analyzer, as it had to specifiy its CharFilter twice: once in the createComponents for the initial Reader, and once in the TokenStreamComponents implementation for reset(Reader).

So i moved this to just be a method of ReusableAnalyzerBase.

Also, i didn't apply the 'throws IOException'. After re-thinking, there is no need to do this. None of our CharFilters for example, throw IOExceptions in their ctors. Even the Analyzer.tokenStream method cannot throw IOException.

We shouldn't add 'throws X exception' just because some arbitrary user class MIGHT throw it, they might throw SQLException, or InvalidMidiDataException too.

asfimport commented 13 years ago

DM Smith (migrated from JIRA)

DM, can you elaborate here?

I was a bit trigger happy with the comment. I should have looked at the code rather than the jira comments alone. The old StandardAnalyzer had a kitchen sink approach to tokenizations trying to do too much with modern constructs, e.g. URLs, email addresses, acronyms.... It and SimpleAnalyzer would produce about the same stream on "old" English and some other texts, but the StandardAnalyzer was much slower. (I don't remember how slow, but it was obvious.)

Both of these were weak when it came to non-English/non-Western texts. Thus I could take the language specific tokenizers, lists of stop words, stemmers and create variations of the SimpleAnalyzer that properly handled a particular language. (I created my own analyzers because I wanted to make stop words and stemming optional)

In looking at the code in trunk (should have done that before making my comment), I see that UAX29Tokenizer is duplicated in the StandardAnalyzer's jflex and that ClassicAnalyzer is the old jflex. Also, the new StandardAnalyzer does a lot less.

If I understand the suggestion of this and the other 2 issues, StandardAnalyzer will no longer handle modern constructs. As I see it this is what SimpleAnalyzer should be: Based on UAX29 and does little else. Thus my confusion. Is there a point to having SimpleAnalyzer? Shouldn't UAX29Tokenizer be moved to core? (What is core anyway?)

And if I understand where this is going: Would there be a way to plugin ICUTokenizer as a replacement for UAX29Tokenizer into StandardTokenizer, such that all Analyzers using StandardTokenizer would get the alternate implementation?

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

DM, thanks, I see exactly where you are coming from.

I see your point: previously it was much easier to take something like SimpleAnalyzer and 'adapt' it to a given language based on things like unicode properties. In fact thats exactly what we did in the cases here (Arabic, Persian, Hindi, etc)

But now we can actually tokenize "correctly" for more languages with jflex, thanks to its improved unicode support, and its superior to these previous hacks :)

to try to answer some of your questions (all my opinion):

Is there a point to having SimpleAnalyzer

I guess so, a lot of people can use this if they have english-only content and are probably happy with discard numbers etc... its not a big loss to me if it goes though.

Shouldn't UAX29Tokenizer be moved to core? (What is core anyway?)

In trunk (4.x codeline) there is no core, contrib, or solr for analyzer components any more. they are all combined into modules/analysis. In branch_3x (3.x codeline) we did not make this rather disruptive refactor: there UAX29Tokenizer is in fact in lucene core.

Would there be a way to plugin ICUTokenizer as a replacement for UAX29Tokenizer into StandardTokenizer, such that all Analyzers using StandardTokenizer would get the alternate implementation?

Personally, i would prefer if we move towards a factory model where things like these supplied "language analyzers" are actually xml/json/properties snippets. In other words, they are just example configurations that builds your analyzer, like solr does. This is nice, because then you dont have to write code to easily customize how your analyzer works.

I think we have been making slow steps towards this, just doing basic things like moving stopwords lists to .txt files. But i think the next step would be #3584, where we have factories/config attribute parsers for all these analysis components already written.

Then we could have support for declarative analyzer specification via xml/json/.properties/whatever, and move all these Analyzers to that. I still think you should be able to code up your own analyzer, but in my opinion this is much easier and preferred for the ones we supply.

Also i think this would solve a lot of analyzer-backwards-compat problems, because then our supplied analyzers are really just configuration file examples, and we can change our examples however we want... someone can use their old config file (and hopefully old analysis module jar file!) to guarantee the exact same behavior if they want.

Finally, most of the benefits of ICUTokenizer are actually in the UAX29 support... the tokenizers are pretty close with some minor differences:

the jflex-based implementation is faster, and better in my opinion.
the ICU-based implementation allows tailoring, and supplies tailored tokenization for several complex scripts (jflex doesnt have this... yet)
the ICU-based implementation works with all of unicode, at the moment jflex is limited to the basic multilingual plane.

In my opinion the last 2 points will probably be eventually resolved... i could see our ICUTokenizer possibly becoming obselete down the road by some better jflex support, though it would have to probably have hooks into ICU for the complex script support (so we get it for free from ICU)

asfimport commented 13 years ago

DM Smith (migrated from JIRA)

Robert, I think we are on the same wavelength. Thanks.

I like the idea of declarative analyzers, too.

Regarding the "last 2 points" has anyone given input to the JFlex team on these needs?

asfimport commented 13 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

DM, I'm a committer on the JFlex team. About the second to last point: when Robert said "(jflex doesnt have this... yet)" he meant the jflex-based implementation, not JFlex itself.

About the last point, JFlex is shooting for level 1 compliance with UTS#18 Unicode Regular Expressions, which requires conforming implementations to "handle the full range of Unicode code points, including values from U+FFFF to U+10FFFF."

asfimport commented 13 years ago

DM Smith (migrated from JIRA)

Shouldn't UAX29Tokenizer be moved to core? (What is core anyway?)

In trunk (4.x codeline) there is no core, contrib, or solr for analyzer components any more. they are all combined into modules/analysis. In branch_3x (3.x codeline) we did not make this rather disruptive refactor: there UAX29Tokenizer is in fact in lucene core.

I meant o.a.l.analysis.core. I'd expect the premier analyzers to be in core.

Is there a point to having SimpleAnalyzer

I guess so, a lot of people can use this if they have english-only content and are probably happy with discard numbers etc... its not a big loss to me if it goes though.

I guess I meant: Shouldn't the SimpleAnalyzer just be constructed the same as StandardAnalyzer with the addition of a Filter that pitch tokens that are not needed? With the suggestion in #3243 to use UAX29Tokenizer for StandardAnalyzer, effectively deprecating EMAIL and URL and possibly adding some kind of PUNCTUATION (so that URLs/emails/acronyms... can be reconstructed, if someone desires), the StandardAnalyzer is about as simple as one could get and properly handle non-english/non-western languages. It just creates ALPHANUM, NUM and PUNCTUATION (if added) that SimpleAnalyzer does not care about.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I meant o.a.l.analysis.core. I'd expect the premier analyzers to be in core.

I guess the package doesn't make a big difference to me, all the analyzers are in one place and the same. its true we mixed the "core" analysis stuff with contrib and solr, but if there are warts with the contrib stuff, we should be able to clean it up (I think this has been happening)

I guess I meant: Shouldn't the SimpleAnalyzer just be constructed the same as StandardAnalyzer with the addition of a Filter that pitch tokens that are not needed?

I don't think so, this seems like a trap. Lots of people use StandardTokenizer without any filter... I don't think it should emit trash (punctuation). if you want to do the other stuff, you can use ClassicTokenizer, or even WhitespaceTokenizer and filter to your heart's content.

asfimport commented 13 years ago

DM Smith (migrated from JIRA)

Robert, Let me ask another way. How about implementing StandardTokenizer using jflex to be UAX29Tokenizer minus NUM and ALPHANUM?

asfimport commented 13 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

How about implementing StandardTokenizer using jflex to be UAX29Tokenizer minus NUM and ALPHANUM?

DM, ALPHANUM is the only "word" token type - there is no ALPHA or WORD type - from the UAX29Tokenizer javadocs:

Tokens produced are of the following types:

<ALPHANUM>: A sequence of alphabetic and numeric characters

<NUM>: A number

<SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer

<IDEOGRAPHIC>: A single CJKV ideographic character

<HIRAGANA>: A single hiragana character

So I guess you think that StandardAnalyzer should exclude tokens that have numeric characters in them?

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Robert, Let me ask another way. How about implementing StandardTokenizer using jflex to be UAX29Tokenizer minus NUM and ALPHANUM?

I think, as i already commented, that it would be best if StandardTokenizer just implemented the UAX#29 algorithm. its a great default and I dont think we should try to alter it.

asfimport commented 13 years ago

DM Smith (migrated from JIRA)

Robert/Steven, I'm sorry. I fat fingered the last post. I really need to take more care. s/Standard/Simple/;

That is, SimpleAnalyzer is not appropriate for many languages. If it were based upon a variation of UAX29Tokenizer, but didn't handle NUM or ALPHANUM, but WORD instead, it would be the same type of token stream, just alpha words.

Regarding compatibility: I think the results for English would be nearly, if not, identical. Western European would only be slightly off from identical. But for other languages it would be an improvement.

At this point, I'm content with what you guys are doing with non-English texts. Great job.

asfimport commented 13 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

That is, SimpleAnalyzer is not appropriate for many languages. If it were based upon a variation of UAX29Tokenizer, but didn't handle NUM or ALPHANUM, but WORD instead, it would be the same type of token stream, just alpha words.

DM, if you want, I could help you convert UAX29Tokenizer in this way - I hang out on IRC #lucene and #lucene-dev a lot of the time. I think you'd want to call it something other than SimpleAnalyzer, though - maybe UAX29AlphabeticTokenizer?

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

That is, SimpleAnalyzer is not appropriate for many languages. If it were based upon a variation of UAX29Tokenizer, but didn't handle NUM or ALPHANUM, but WORD instead, it would be the same type of token stream, just alpha words.

Ok, now i understand you, and yes I agree... My question is, should we even bother fixing it? Like would anyone who actually cares about unicode really want only some hacked subset of UAX#29 ?

These simple ones like SimpleAnalyzer, WhitespaceAnalyzer, StopAnalyzer are all really bad for Unicode text in different ways, though Simple/Stop are bigger offenders i think, because they will separate a base character from its combining characters (in my opinion, this should always be avoided) and worse: they will break on these.

But people using them are probably happy? e.g. you can do like Solr, use whitespaceanalyzer and follow thru with something like WordDelimiterFilter and its mostly ok, depending upon options, except for cases like CJK where its a death trap.

Personally i just don't use these things since I know the problems, but we could document "this is simplistic and won't work well for many languages" and keep them around for people that don't care?

And yeah i suppose its confusing these really "simple" ones are in the .core package, but to me the package is meaningless, i was just trying to keep the analyzers arranged in some kind of order (e.g. pattern-based analysis in the .pattern package, etc).

We could just as well call the package .basic or .simple or something else, its just a name.

asfimport commented 13 years ago

DM Smith (migrated from JIRA)

Robert, I think

"core" is a bad name that needs to be changed. It is misleading.
Documentation should be improved along the lines you suggest.

You mention a few broken Analyzers (and by implication related tokenizers and filters). I've a question about LowerCaseFilter: Isn't it bad as well?

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

OK, i'd like to move forward with this if there are no objections, it improves these analyzers by improving their multilingual capabilities, allowing them to support numbers etc, and deprecates/removes useless custom code (only unreleased functionality is removed)

We also add CharFilter support to the ReusableAnalyzerBase, which was missing before.

DM: we can open a followup issue to improve the docs for the simplistic tokenizers as discussed here.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Committed revision 1043114, 1043181 (3x)

asfimport commented 13 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Bulk close for 3.1

apache / lucene

Deprecate/remove language-specific tokenizers in favor of StandardTokenizer [LUCENE-2747] #3821