Implement StandardTokenizer with the UAX#29 Standard [LUCENE-2167]

asfimport commented 14 years ago

It would be really nice for StandardTokenizer to adhere straight to the standard as much as we can with jflex. Then its name would actually make sense.

Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer, as its javadoc claims:

This should be a good tokenizer for most European-language documents

The new StandardTokenizer could then say

This should be a good tokenizer for most languages.

All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay with that EuropeanTokenizer, and it could be used by the european analyzers.

Migrated from LUCENE-2167 by Shyamal Prasad, resolved Nov 15 2010 Attachments: LUCENE-2167.benchmark.patch (versions: 3), LUCENE-2167.patch (versions: 19), LUCENE-2167-jflex-tld-macro-gen.patch (versions: 3), LUCENE-2167-lucene-buildhelper-maven-plugin.patch, standard.zip, StandardTokenizerImpl.jflex Linked issues:

2619
- 2776
- 2630
- 3320
- 3837

asfimport commented 14 years ago

Shyamal Prasad (migrated from JIRA)

Patch fixes Javadoc with suggested text, adds test cases to motivate change.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hi Shyamal, I am not sure we should document this behavior, but instead improve standard analyzer.

Like you said it is hard to make everyone happy, but we now have a mechanism to improve things, that is based on that Version constant you provide. For example, in a future release we hope to be able to use Jflex 1.5, which has greatly improved unicode support.

you can try your examples against unicode segmentation standards here to get a preview of what this might look like: http://unicode.org/cldr/utility/breaks.jsp

asfimport commented 14 years ago

Shyamal Prasad (migrated from JIRA)

Hi Robert, I presume that when you say we should "instead improve standard analyzer" you mean the code should work more like the original Javadoc states it should? Or are you suggesting that moving to Jflex 1.5

The problem I observed was that the current JFlex rules don't implement what the Javadoc says is the behavior of the tokenizer. I'd be happy to spend some time on this if I could get some direction on where I should focus.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hi Robert, I presume that when you say we should "instead improve standard analyzer" you mean the code should work more like the original Javadoc states it should?

Shyamal I guess what I am saying is I would prefer the javadoc of StandardTokenizer to be a little vague as to exactly what it does. I would actually prefer it have less details than it currently has: in my opinion it starts getting into nitty-gritty details of what could be considered Version-specific.

I'd be happy to spend some time on this if I could get some direction on where I should focus.

If you have fixes to the grammar, I would prefer this over 'documenting buggy behavior'. #3150 gives us the capability to fix bugs without breaking backwards compatibility.

asfimport commented 14 years ago

Shyamal Prasad (migrated from JIRA)

Hi Robert,

It's been a while but I finally got around to working on the grammar. Clearly, much of this is an opinion, so I finally stuck to the one minor change that I believe is arguably an improvement. Previously comma separated fields containing digits would be mistaken for numbers and combined into a single token. I believe this is a mistake because part numbers etc. are rarely comma separated, and regular text that is comma separated is not uncommon. This is also the problem I ran into in real life when using Lucene :)

This patch stops treating comma separated tokens as numbers when they contain digits.

I did not included the patched Java file since I don't know what JFlex version I should use to create it (I used JFlex 1.4.3, and test-tag passes with JDK 1.5/1.6; I presume the Java 1.4 compatibility comment in the generated file is now history?).

Let me know if this is headed in a useful direction.

Cheers! Shyamal

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Clearly, much of this is an opinion, so I finally stuck to the one minor change that I believe is arguably an improvement. Previously comma separated fields containing digits would be mistaken for numbers and combined into a single token. I believe this is a mistake because part numbers etc. are rarely comma separated, and regular text that is comma separated is not uncommon.

I don't think it really has to be, i actually am of the opinion StandardTokenizer should follow unicode standard tokenization. then we can throw subjective decisions away, and stick with a standard.

In this example, i think the change would be bad, as the comma is treated differently depending upon context, as it is a decimal separator and thousands separator in many languages, including English. so, the treatment of the comma depends upon the previous character.

this is why in unicode, the comma has the Mid_Num Word_Break property.

asfimport commented 14 years ago

Shyamal Prasad (migrated from JIRA)

I don't think it really has to be, i actually am of the opinion StandardTokenizer should follow unicode standard tokenization. then we can throw subjective decisions away, and stick with a standard.

Yep, I see I am going for the wrong ambition level and only tweaking the existing grammar. I'll take a crack at understanding unicode standard tokenization, as you'd suggested originally, and try and produce something as soon as I get a chance. I see your point.

Cheers! Shyamal

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'll take a crack at understanding unicode standard tokenization, as you'd suggested originally, and try and produce something as soon as I get a chance.

I would love it if you could produce a grammar that implemented UAX#29!

If so, in my opinion it should become the StandardAnalyzer for the next lucene version. If I thought I could do it correctly, I would have already done it, as the support for the unicode properties needed to do this is now in the trunk of Jflex!

here are some references that might help: The standard itself: http://unicode.org/reports/tr29/

particularly the "Testing" portion: http://unicode.org/reports/tr41/tr41-5.html#Tests29

Unicode provides a WordBreakTest.txt file, that we could use from Junit, to help verify correctness: http://www.unicode.org/Public/UNIDATA/auxiliary/WordBreakTest.txt

I'll warn you I think it might be hard, but perhaps its not that bad. In particular the standard is defined in terms of "chained" rules, and Jflex doesnt support rule chaining, but I am not convinced we need rule chaining to implement WordBreak (maybe LineBreak, but maybe WordBreak can be done easily without it?)

Steven Rowe is the expert on this stuff, maybe he has some ideas.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

btw, here is some statement that seems to confirm my suspicions, from the standard:

In section 6.3, there is an example of the grapheme cluster boundaries converted into a simple regex (the kind we could do easily in jflex now that it has the properties available).

They make this statement: Such a regular expression can also be turned into a fast, deterministic finite-state machine. Similar regular expressions are possible for Word boundaries. Line and Sentence boundaries are more complicated, and more difficult to represent with regular expressions.

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I wrote word break rules grammar specifications for JFlex 1.5.0-SNAPSHOT and both Unicode versions 5.1 and 5.2 - you can see the files here:

http://jflex.svn.sourceforge.net/viewvc/jflex/trunk/testsuite/testcases/src/test/cases/unicode-word-break/

The files are UnicodeWordBreakRules_5_\*.\* - these are written to: parse the Unicode test files; run the generated scanner against each composed test string; output the break opportunities/prohibitions in the same format as the test files; and then finally compare the output against the test file itself, looking for a match. (These tests currently pass.)

The .flex files would need to be significantly changed to be used as a StandardTokenizer replacement, but you can get an idea from them how to implement the Unicode word break rules in (as yet unreleased version 1.5.0) JFlex syntax.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Steven, thanks for providing the link.

I guess this is the point where I also say, I think it would be really nice for StandardTokenizer to adhere straight to the standard as much as we can with jflex (I realize in 1.5, we won't have > 0xffff support). Then its name would actually make sense.

In my opinion, such a transition would involve something like renaming the old StandardTokenizer to EuropeanTokenizer, as its javadoc claims:

This should be a good tokenizer for most European-language documents

The new StandardTokenizer could then say

This should be a good tokenizer for most languages.

All the english/euro-centric stuff like the acronym/company/apostrophe stuff could stay with that "EuropeanTokenizer" or whatever its called, and it could be used by the european analyzers.

but if we implement the Unicode rules, I think we should drop all this english/euro-centric stuff for StandardTokenizer. Otherwise it should be called StandardishTokenizer.

we can obviously preserve the backwards compat with Version, as Uwe has created a way to use a different grammar for a different Version.

I expect some -1 to this, waiting comments :)

asfimport commented 14 years ago

Shyamal Prasad (migrated from JIRA)

Robert Muir wrote:

I would love it if you could produce a grammar that implemented UAX#29!

If so, in my opinion it should become the StandardAnalyzer for the next lucene version. If I thought I could do it correctly, I would have already done it, as the support for the unicode properties needed to do this is now in the trunk of Jflex!

I'm not smart enough to know if I should even try to do it at all (leave alone correctly), but am always willing to learn! Thanks for the references, I will certainly give it an honest try.

/Shyamal

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

(stole Robert's comment to change the issue description)

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Patch implementing a UAX#29 tokenizer, along with most of Robert's TestICUTokenizer tests (left out tests for Thai, Lao, and breaking at 4K chars, none of which are features of this tokenizer) - I re-upcased the downcased expected terms, and un-normalized the trailing greek lowercase sigma one of the expected terms in testGreek().

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I want to test performance relative to StandardTokenizer and ICUTokenizer, and also consider switching from lookahead chaining to single regular expression per term type to improve performance.

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I ran contrib/benchmark over 10k Reuters docs with tokenization-only analyzers; Sun JDK 1.6, Windows Vista/Cygwin; best of five:

Operation	recsPerRun	rec/s	elapsedSec
StandardTokenizer	1262799	655,318.62	1.93
ICUTokenizer	1268451	536,116.25	2.37
UAX29Tokenizer	1268451	524,586.88	2.42

I think UAX29Tokenizer is slower than StandardTokenizer because it does the lookahead/chaining thing. Still, decent performance.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hi Steve, this is great progress!

Looking at the code/perf, is there anyway to avoid the CharBuffer.wrap calls in updateAttributes()?

It seems since you are just appending, it might be better to use some "append" like:

int newLength = termAtt.length() + <length of text you are appending from zzBuffer>)
char bufferWithRoom[] = termAtt.resizeBuffer(newLength);
System.arrayCopy(from zzBuffer into bufferWithRoom, starting at termAtt.length());
termAtt.setLength(newLength);

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I added your change removing CharBuffer.wrap(), Robert, and it appears to have sped it up, though not as much as I would like:

Operation	recsPerRun	rec/s	elapsedSec
StandardTokenizer	1262799	647,589.23	1.95
ICUTokenizer	1268451	526,328.22	2.41
UAX29Tokenizer	1268451	558,788.99	2.27

I plan on attempting to rewrite the grammar to eliminate chaining/lookahead this weekend.

edit: fixed the rec/s, which were from the worst of five instead of the best of five - the elapsedSec, however, were correct.

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Attached a patch that removes lookahead/chaining. All tests pass.

UAX29Tokenizer is now in the same ballpark performance-wise as StandardTokenizer:

Operation	recsPerRun	rec/s	elapsedSec
StandardTokenizer	1262799	658,737.06	1.92
ICUTokenizer	1268451	542,768.94	2.34
UAX29Tokenizer	1268451	668,661.56	1.90

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hi Steven: this is impressive progress!

What do you think the next steps should be?

should we look at any tailorings to this? The first thing that comes to mind is full-width forms, which have no WordBreak property
is it simple, or would it be messy, to apply this to the existing grammar (English/EuroTokenizer)? Another way to say it, is it possible for English/EuroTokenizer (StandardTokenizer today) to instead be a tailoring to UAX#29, for companies,acronym, etc, such that if it encounters say some hindi or thai text it will behave better?

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

should we look at any tailorings to this? The first thing that comes to mind is full-width forms, which have no WordBreak property

Looks like Latin full-width letters are included (from http://www.unicode.org/Public/5.2.0/ucd/auxiliary/WordBreakProperty.txt):

FF21..FF3A ; ALetter # L& [26] FULLWIDTH LATIN CAPITAL LETTER A..FULLWIDTH LATIN CAPITAL LETTER Z FF41..FF5A ; ALetter # L& [26] FULLWIDTH LATIN SMALL LETTER A..FULLWIDTH LATIN SMALL LETTER Z

But as you mention in a code comment in TestICUTokenizer, there are no full-width WordBreak:Numeric characters, so we could just add these to the {NumericEx} macro, I think.

Was there anything else you were thinking of?

is it simple, or would it be messy, to apply this to the existing grammar (English/EuroTokenizer)? Another way to say it, is it possible for English/EuroTokenizer (StandardTokenizer today) to instead be a tailoring to UAX#29, for companies,acronym, etc, such that if it encounters say some hindi or thai text it will behave better?

Not sure about difficulty level, but it should be possible.

Naming will require some thought, though - I don't like EnglishTokenizer or EuropeanTokenizer - both seem to exclude valid constituencies.

What do you think about adding tailorings for Thai, Lao, Myanmar, Chinese, and Japanese? (Are there others like these that aren't well served by UAX#29 without customizations?)

I'm thinking of leaving UAX29Tokenizer as-is, and adding tailorings as separate classes - what do you think?

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

One other thing, Robert: what do you think of adding URL tokenization?

I'm not sure whether it's more useful to have the domain and path components separately tokenized. But maybe if someone wants that, they could add a filter to decompose?

It would be impossible to do post-tokenization composition to get back the original URL, however, so I'm leaning toward adding URL tokenization.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

But as you mention in a code comment in TestICUTokenizer, there are no full-width WordBreak:Numeric characters, so we could just add these to the {NumericEx} macro, I think.

Was there anything else you were thinking of?

No, that's it!

Naming will require some thought, though - I don't like EnglishTokenizer or EuropeanTokenizer - both seem to exclude valid constituencies.

What valid constituencies do you refer to? In general the acronym,company,possessive stuff here are very english/euro-specific. Bugs in JIRA get opened if it doesn't do this stuff right on english, but it doesn't even work at all for a lot of languages. Personally I think its great to rip this stuff out of what should be a "default" language-independent tokenizer based on standards (StandardTokenizer), and put it into the language-specific package that it belongs. Otherwise we have to worry about these sort of things overriding and screwing up UAX#29 rules for words in real languages.

What do you think about adding tailorings for Thai, Lao, Myanmar, Chinese, and Japanese? (Are there others like these that aren't well served by UAX#29 without customizations?)

It gets a little tricky: we should be careful about how we interpret what is "reasonable" for a language-independent default tokenizer. I think its "enough" to output the best indexing unit that is possible and relatively unambiguous to identify. I think this is a shortcut we can make, because we are trying to tokenize things for information retrieval, not for other purposes. The approach for Lao, Myanmar, Khmer, CJK, etc in ICUTokenizer is to just output syllables as indexing unit, since words are ambiguous. Thai is based on words, not syllables, in ICUTokenizer, which is inconsistent from this, but we get this for free, so its just a laziness thing.

By the way: none of those syllable-grammars in ICUTokenizer used chained rules, so you are welcome to steal what you want!

I'm thinking of leaving UAX29Tokenizer as-is, and adding tailorings as separate classes - what do you think?

Well, either way I again strongly feel this logic should be tied into "Standard" tokenizer, so that it has better unicode behavior. I think it makes sense for us to have a reasonable, language-independent, standards-based tokenizer that works well for most languages. I think it also makes sense to have English/Euro-centric stuff thats language-specific, sitting in the analysis.en package just like we do with other languages.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

One other thing, Robert: what do you think of adding URL tokenization?

I think I would lean towards not doing this, only because of how complex a URL can be these days. It also starts to get a little ambiguous and will likely interfere with other rules (generating a lot of false positives).

I guess I don't care much either way, if its strict and standards-based, it probably won't cause any harm. But if you start allowing things like http urls without the http:// being present, its gonna cause some problems.

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Naming will require some thought, though - I don't like EnglishTokenizer or EuropeanTokenizer - both seem to exclude valid constituencies.

What valid constituencies do you refer to?

Well, we can't call it English/EuropeanTokenizer (maybe EnglishAndEuropeanAnalyzer? seems too long), and calling it either only English or only European seems to leave the other out. Americans, e.g., don't consider themselves European, maybe not even linguistically (however incorrect that might be).

In general the acronym,company,possessive stuff here are very english/euro-specific.

Right, I agree. I'm just looking for a name that covers the languages of interest unambiguously. WesternTokenizer? (but "I live east of the Rockies - can I use WesternTokenizer?"...) Maybe EuropeanLanguagesTokenizer? The difficulty as I see it is the messy intersection between political, geographic, and linguistic boundaries.

Bugs in JIRA get opened if it doesn't do this stuff right on english, but it doesn't even work at all for a lot of languages. Personally I think its great to rip this stuff out of what should be a "default" language-independent tokenizer based on standards (StandardTokenizer), and put it into the language-specific package that it belongs. Otherwise we have to worry about these sort of things overriding and screwing up UAX#29 rules for words in real languages.

I assume you don't mean to say that English and European languages are not real languages :) .

What do you think about adding tailorings for Thai, Lao, Myanmar, Chinese, and Japanese? (Are there others like these that aren't well served by UAX#29 without customizations?)

It gets a little tricky: we should be careful about how we interpret what is "reasonable" for a language-independent default tokenizer. I think its "enough" to output the best indexing unit that is possible and relatively unambiguous to identify. I think this is a shortcut we can make, because we are trying to tokenize things for information retrieval, not for other purposes. The approach for Lao, Myanmar, Khmer, CJK, etc in ICUTokenizer is to just output syllables as indexing unit, since words are ambiguous. Thai is based on words, not syllables, in ICUTokenizer, which is inconsistent from this, but we get this for free, so its just a laziness thing.

I think that StandardTokenizer should contain tailorings for CJK, Thai, Lao, Myanmar, and Khmer, then - it should be able to do reasonable things for all languages/scripts, to the greatest extent possible.

The English/European tokenizer can then extend StandardTokenizer (conceptually, not in the Java sense).

I'm thinking of leaving UAX29Tokenizer as-is, and adding tailorings as separate classes - what do you think?

Well, either way I again strongly feel this logic should be tied into "Standard" tokenizer, so that it has better unicode behavior. I think it makes sense for us to have a reasonable, language-independent, standards-based tokenizer that works well for most languages. I think it also makes sense to have English/Euro-centric stuff thats language-specific, sitting in the analysis.en package just like we do with other languages.

I agree that stuff like giving "O'Reilly's" the <APOSTROPHE> type, to enable so-called StandardFilter to strip out the trailing /'s/, is stupid for all non-English languages.

It might be confusing, though, for a (e.g.) Greek user to have to go look at the analysis.en package to get reasonable performance for her language.

Maybe an EnglishTokenizer, and separately a EuropeanAnalyzer? Is that what you've been driving at all along??? (Silly me.... Sigh.)

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

What do you think about adding tailorings for Thai, Lao, Myanmar, Chinese, and Japanese?

By the way: none of those syllable-grammars in ICUTokenizer used chained rules, so you are welcome to steal what you want!

Thanks, I will! Of course now that you've given permission, it won't be as much fun...

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

One other thing, Robert: what do you think of adding URL tokenization?

I think I would lean towards not doing this, only because of how complex a URL can be these days. It also starts to get a little ambiguous and will likely interfere with other rules (generating a lot of false positives).

I have written standards-based URL tokenization routines in the past. I agree it's very complex, but I know it's do-able.

Do you have some examples of false positives? I'd like to add tests for them.

I guess I don't care much either way, if its strict and standards-based, it probably won't cause any harm. But if you start allowing things like http urls without the http:// being present, its gonna cause some problems.

Yup, I would only accept strictly correct URLs.

Now that international TLDs are a reality, it would be cool to be able to identify them.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I assume you don't mean to say that English and European languages are not real languages .

I think the heuristics I am talking about that are in StandardTokenizer today, that don't really even work*, shouldn't have a negative effect on other languages, thats all.

I agree that stuff like giving "O'Reilly's" the <APOSTROPHE> type, to enable so-called StandardFilter to strip out the trailing /'s/, is stupid for all non-English languages.

It might be confusing, though, for a (e.g.) Greek user to have to go look at the analysis.en package to get reasonable performance for her language.

fyi, GreekAnalyzer didn't even use this stuff until 3.1 (it omitted StandardFilter).

But I don't think it matters where we put the "western" tokenizer, as long as its not StandardTokenizer. I don't really even care too much about the stuff it does honestly, I don't consider it very important, nor very accurate, only the source of many jira bugs* and hassle and confusion (invalidAcronym etc). Just seems to be more trouble than its worth.

2512
3320
2861
2477
2177
2630
1649
1112
2145
i stopped at this point, i think this is enough examples

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Yup, I would only accept strictly correct URLs.

Now that international TLDs are a reality, it would be cool to be able to identify them.

+1. This is in my opinion, the way such things in Standard Tokenizer should work. Perhaps too strict for some folks tastes, but correct!

asfimport commented 14 years ago

Marvin Humphrey (migrated from JIRA)

I find that it works well to parse URLs as multiple tokens, so long as the query parser tokenizes them as phrases rather than individual terms. That allows you to hit on URL substrings, so e.g. a document containing 'http://www.example.com/index.html' is a hit for 'example.com'.

Happily, no special treatment for URLs also makes for a simpler parser.

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Good point, Marvin - indexing URLs makes no sense without query support for them. (Is this a stupid can of worms for me to have opened?) I have used Lucene tokenizers for other things than retrieval (e.g. term vectors as input to other processes), and I suspect I'm not alone. The ability to extract URLs would be very nice.

Ideally, URL analysis would produce both the full URL as a single token, and as overlapping tokens the hostname, path components, etc. However, I don't think it's a good idea for the tokenizer to output overlapping tokens - I suspect this would break more than a few things.

A filter that breaks URL type tokens into their components, and then adds them as overlapping tokens, or replaces the full URL with the components, should be easy to write, though.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

A filter that breaks URL type tokens into their components, and then adds them as overlapping tokens, or replaces the full URL with the components, should be easy to write, though.

Not sure, for this to really work for non-english, it should recognize and normalize punycode representations of international domain names, etc.

So while its a good idea, maybe it is a can of worms, and better to leave it alone for now?

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

A filter that breaks URL type tokens into their components, and then adds them as overlapping tokens, or replaces the full URL with the components, should be easy to write, though.

Not sure, for this to really work for non-english, it should recognize and normalize punycode representations of international domain names, etc.

So while its a good idea, maybe it is a can of worms, and better to leave it alone for now?

Do you mean URL-as-token should not be attempted now? Or just this URL-breaking filter?

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Do you mean URL-as-token should not be attempted now? Or just this URL-breaking filter?

We can always add tailorings later, as Uwe has implemented Version-based support.

Personally I see no problems with this patch, and I think we should look at tying this in as-is as the new StandardTokenizer, still backwards compatible thanks to Version support (we can just invoke EnglishTokenizerImpl in that case).

I still want to rip StandardTokenizer out of lucene core and into modules. I think thats not too far away and its probably better to do this afterwards?, but we can do it before that time if you want, doesn't matter to me.

It will be great to have StandardTokenizer working for non-European languages out of box!

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I think UAX29Tokenizer should remain as-is, except that I think there are some valid letter chars (Lao/Myanmar, I think) that are being dropped rather than returned as singletons, as CJ chars are now. I need to augment the tests and make sure that valid word/number chars are not being dropped. Also, I want to add full-width numeric chars to the {NumericEx} macro.

A separate replacement StandardTokenizer class should have standards-based email and url tokenization - the current StandardTokenizer gets part of the way there, but doesn't support some valid emails, and while it recognizes host/domain names, it doesn't recognize full URLs. I want to get this done before anything in this issue is committed.

Then (after this issue is committed), in separate issues, we can add EnglishTokenizer (for things like acronyms and maybe removing posessives (current StandardFilter), and then as needed, other language-specific tokenizers.

I still want to rip StandardTokenizer out of lucene core and into modules. I think thats not too far away and its probably better to do this afterwards?, but we can do it before that time if you want, doesn't matter to me.

I'll finish the UAX29Tokenizer fixes this weekend, but I think it'll take me a week or so to get the URL/email tokenization in place.

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Currently in StandardTokenizer there is a hack to allow contiguous Thai chars to be sent in a block to the ThaiWordFilter, which then uses the JDK BreakIterator to generate words.

Robert, were you thinking of not supporting that in the StandardTokenizer replacement in the short term?

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Currently in StandardTokenizer there is a hack to allow contiguous Thai chars to be sent in a block to the ThaiWordFilter, which then uses the JDK BreakIterator to generate words. Robert, were you thinking of not supporting that in the StandardTokenizer replacement in the short term?

You don't need any special support.

I don't know how this hack founds its way in, but from a Thai tokenization perspective the only thing it is doing is preventing StandardTokenizer from splitting thai on non-spacing marks (like it does wrongly for other languages).

So UAX#29 itself is the fix...

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

except that I think there are some valid letter chars (Lao/Myanmar, I think) that are being dropped rather than returned as singletons

Do you have any examples?

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

except that I think there are some valid letter chars (Lao/Myanmar, I think) that are being dropped rather than returned as singletons

Do you have any examples?

I imported your tests from TestICUTokenizer, but I left out Lao, Myanmar and Thai because I didn't plan on adding tailorings like those you put in for ICUTokenizer. However, I think Lao had zero tokens output, so if you just import the Lao test from TestICUTokenizer you should see the issue.

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Currently in StandardTokenizer there is a hack to allow contiguous Thai chars to be sent in a block to the ThaiWordFilter, which then uses the JDK BreakIterator to generate words.

Robert, were you thinking of not supporting that in the StandardTokenizer replacement in the short term?

I don't know how this hack founds its way in, but from a Thai tokenization perspective the only thing it is doing is preventing StandardTokenizer from splitting thai on non-spacing marks (like it does wrongly for other languages).

So UAX#29 itself is the fix...

AFAICT, UAX#29 would output individual Thai chars, just like CJ. Is that appropriate?

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

However, I think Lao had zero tokens output, so if you just import the Lao test from TestICUTokenizer you should see the issue.

Ok, I will take a look. The algorithm there has some handling for incorrectly ordered unicode, for example combining characters before the base form when they should be after... so it might be no problem at all

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

AFAICT, UAX#29 would output individual Thai chars, just like CJ. Is that appropriate?

What is a Thai character? :). According to the standard, it should be outputting phrases as there is nothing to delimit them... you can see this by pasting some text into http://unicode.org/cldr/utility/breaks.jsp

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

AFAICT, UAX#29 would output individual Thai chars, just like CJ. Is that appropriate?

What is a Thai character? . According to the standard, it should be outputting phrases as there is nothing to delimit them... you can see this by pasting some text into http://unicode\.org/cldr/utility/breaks\.jsp

Yeah, your Thai text "การที่ได้ต้องแสดงว่างานดี. แล้วเธอจะไปไหน? ๑๒๓๔" breaks at space and punctuation and nowhere else. This test should be put back into TestUAX29Tokenizer with the appropriate expected output.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hmm i ran some tests, I think i see your problem.

I tried this:

  public void testThai() throws Exception {
    assertAnalyzesTo(a, "ภาษาไทย", new String[] { "ภาษาไทย" });
  }

The reason you get something different than the unicode site, is because recently? these have [:WordBreak=Other:] Instead anything that needs a dictionary or whatever is identified by [:Line_Break=Complex_Context:] You can see this mentioned in the standard:

In particular, the characters with the Line_Break property values of Contingent_Break (CB), 
Complex_Context (SA/South East Asian), and XX (Unknown) are assigned word boundary property 
values based on criteria outside of the scope of this annex.

In ICU, i noticed the default rules do this: $dictionary = [:LineBreak = Complex_Context:]; $dictionary $dictionary

(so they just stick together with this chained rule)

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Yeah, your Thai text "การที่ได้ต้องแสดงว่างานดี. แล้วเธอจะไปไหน? ๑๒๓๔" breaks at space and punctuation and nowhere else. This test should be put back into TestUAX29Tokenizer with the appropriate expected output.

But why does it fail for my test (listed above) with only a single thai phrase (nothing is output)? Do you think it is because of Complex_Context or is there an off-by-one bug somehow?

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Yeah, your Thai text "การที่ได้ต้องแสดงว่างานดี. แล้วเธอจะไปไหน? ๑๒๓๔" breaks at space and punctuation and nowhere else. This test should be put back into TestUAX29Tokenizer with the appropriate expected output.

But why does it fail for my test (listed above) with only a single thai phrase (nothing is output)? Do you think it is because of Complex_Context or is there an off-by-one bug somehow?

Definitely Complex_Content. I'll add that in, and this should address Thai, Myanmar, Khmer, Tai Le, etc.

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

New patch addressing the following issues:

On #lucene-dev, Uwe mentioned that methods in the generated scanner should be (package) private, since unlike the current StandardTokenizer, UAX29Tokenizer is not hidden behind a facade class. I added JFlex's %apiprivate option to fix this issue.
Thai, Lao, Khmer, Myanmar and other scripts' characters are now kept together, like the ICU UAX#29 implementation, using rule [:Line_Break = Complex_Context:]+.
Added the Thai test back from Robert's TestICUTokenizer.
Added full-width numeric characters to the {NumericEx} macro, so that they can be appropriately tokenized, just like full-width alpha characters are now.

I couldn't find any suitable Lao test text (mostly because I don't know Lao at all), so I left out the Lao test in TestICUTokenizer, because Robert mentioned on #lucene that its characters are not in logical order.

edit Complex_Content --> Complex_Context edit #2 Added bullet about full-width numerics issue

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I couldn't find any suitable Lao test text (mostly because I don't know Lao at all), so I left out the Lao test in TestICUTokenizer, because Robert mentioned on #lucene that its characters are not in logical order.

Only some of my icu tests contain "screwed up lao".

But you should be able to use "good text" and it should do the right thing. Here's a test

assertAnalyzesTo(a, "ສາທາລະນະລັດ ປະຊາທິປະໄຕ ປະຊາຊົນລາວ", 
new String[] { "ສາທາລະນະລັດ", "ປະຊາທິປະໄຕ", "ປະຊາຊົນລາວ" });

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

New patch:

added Robert's Lao test (thanks, Robert).
added a javadoc comment about UAX29Tokenizer not handling supplementary characters (thanks to Uwe for bringing this up on #lucene), with a pointer to Robert's ICUTokenizer.

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

This patch contains the benchmarking implementation I've been using. I'm pretty sure we don't want this stuff in Lucene, so I'm including it here only for reproducibility by others. I have hardcoded absolute paths to the ICU4J jar and the contrib/icu jar in the script I use to run the benchmark (lucene/contrib/benchmark/scripts/compare.uax29.analyzers.sh), so if anybody tries to run this stuff, they will have to first modify that script.

On #lucene, Robert suggested comparing the performance of the straight ICU4J RBBI against UAX29Tokenizer, so I took his ICUTokenizer and associated classes, stripped out the script-detection logic, and made something I named RBBITokenizer, which is included in this patch.

To run the benchmark, you have to first run "ant jar" in lucene/ to produce the lucene core jar, and then again in lucene/contrib/icu/. Then in contrib/benchmark/, run scripts/compare.uax29.analyzers.sh.

Here are the results on my machine (Sun JDK 1.6.0_13; Windows Vista/Cygwin; best of five):

Operation	recsPerRun	rec/s	elapsedSec
ICUTokenizer	1268451	548,638.00	2.31
RBBITokenizer	1268451	568,047.94	2.23
StandardTokenizer	1262799	644,614.06	1.96
UAX29Tokenizer	1268451	640,631.81	1.98

apache / lucene

Implement StandardTokenizer with the UAX#29 Standard [LUCENE-2167] #3243

2619

2776

2630

3320

3837

2512

3320

2861

2477

2177

2630

1649

1112

2145