StandardTokenizer doesn't tokenize word:word [LUCENE-6103]

asfimport commented 9 years ago

StandardTokenizer (and by result most default analyzers) will not tokenize word:word and will preserve it as one token. This can be easily seen using Elasticsearch's analyze API:

localhost:9200/_analyze?tokenizer=standard&text=word%20word:word

If this is the intended behavior, then why? I can't really see the logic behind it.

If not, I'll be happy to join in the effort of fixing this.

Migrated from LUCENE-6103 by Itamar Syn-Hershko, resolved Dec 09 2014

asfimport commented 9 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

StandardTokenizer implements the word boundary rules in Unicode UAX#29.

The ASCII colon (and other colonicalish forms) is included in the set of characters matched by the WordBreak:MidLetter property value, which appears in rules WB6 and WB7 - these rules forbid word breaks between the colon and surrounding letters.

To get what you want, you could customize the JFlex grammar used to generate StandardTokenizer by removing colons from the MidLetter definition used.

Another alternative is ICUTokenizer, which allows runtime per-orthographic-script specification of word-break rules - check out the factory javadocs: http://lucene.apache.org/core/4_9_0/analyzers-icu/org/apache/lucene/analysis/icu/segmentation/ICUTokenizerFactory.html

asfimport commented 9 years ago

Itamar Syn-Hershko (migrated from JIRA)

Yes, I figured it will be down to some Unicode rules. Can you give a rationale for this, mainly out of curiosity?

Not a Unicode expert, but I'd assume just like you wouldn't want English words to not-break on Hebrew Punctuation Gershayim (e.g. Test"Word is actually 2 tokens and מנכ"לים is one), maybe this rule is meant for specific scenarios and not for the general use case?

On another note, any type of Gershayim should be preserved within Hebrew words, not only U+05F4. This is mainly because keyboards and editors used produce the standard " character in most cases. I had a chat with Robert a while back where he said that's the case, I'm just making sure you didn't follow the specs to the letter in that regard...

asfimport commented 9 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Yes, I figured it will be down to some Unicode rules. Can you give a rationale for this, mainly out of curiosity?

The comment in the MidLetter list says it's for Swedish. If you look at the revision history at the bottom of the page, the colon was temporarily removed from MidLetter in between Unicode versions 6.2 and 6.3, but then put back before 6.3 was released (I guess this should be read from the bottom upward):

Restored colon and equivalents (removed in previous draft).

Removed colon from MidLetter, so that it is no longer contained within words. Handling of colon for word boundary determination in Swedish would be done by tailoring, instead – for example by a Swedish localization definition in CLDR.

I guess the Swedish contingent among Unicoders is strong?

Not a Unicode expert, but I'd assume just like you wouldn't want English words to not-break on Hebrew Punctuation Gershayim (e.g. Test"Word is actually 2 tokens and מנכ"לים is one), maybe this rule is meant for specific scenarios and not for the general use case?

StandardTokenizer is not intended to be English-centric - instead it should do something reasonable with any text.

On another note, any type of Gershayim should be preserved within Hebrew words, not only U+05F4. This is mainly because keyboards and editors used produce the standard " character in most cases. I had a chat with Robert a while back where he said that's the case, I'm just making sure you didn't follow the specs to the letter in that regard...

I did follow the specs to the letter, and it does the right thing:

Rules WB7b and WB7c forbid breaks around the ASCII double quote character, but only when surrounded by Hebrew letters.

asfimport commented 9 years ago

Itamar Syn-Hershko (migrated from JIRA)

Good stuff, thanks Steve. I'm going to see if the rest of the UAX is good for us, and if so see if I can comply with the 6.2.5 version of the specs.

It's a good thing StandardTokenizer is no longer English centric, but I cannot imagine what use the colon has especially since in most cases it is not "something reasonable" :)

asfimport commented 9 years ago

Itamar Syn-Hershko (migrated from JIRA)

Ok so I did some homework. In swedish, "connect" is a way to shortcut writings of words. So "C:a" is infact "cirka" which means "approximately". I guess it can be thought of as English acronyms, only apparently its way less commonly used in Swedish (my source says "very very seldomly used; old style and not used in modern Swedish at all").

Not only it is hardly being used, apparently it's only legal in 3 letter combinations (c:a but not c:ka).

And also, the affects it has are quite severe at the moment - 2 words with a colon in between that didn't have space will be outputted as one token even though its 100% its not applicable to Swedish, since each words has > 2 characters.

I'm not aiming at changing the Unicode standards, that's way beyond my limited powers, but:

Given the above, does it really make sense to use this tokenizer in all language-specific analyzers as well? e.g. https://github.com/apache/lucene-solr/blob/lucene_solr_4_9_1/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L105

I'd think for language specific analyzers we'd want tokenizers aiming for this language with limited support for others. So, in this case, colon will always be considered a tokenizing char.

Can we change the jflex definition to at least limit the effects of this, e.g. only support colon as MidLetter if the overall token length == 3, so c:a is a valid token and word:word is not?

asfimport commented 9 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Cool info about Swedish.

The beauty of implementing a standard is that once you've done that, making tweaks to suit particular constituencies isn't necessary. StandardTokenizer implements UAX#29 word break rules. Done.
If you'd like to create tailored tokenizers for each individual language, please go ahead.
See #0.

One other technique you may find useful: put a char filter to change problematic chars in front of your tokenizer, e.g. PatternReplaceCharFilter, with the pattern something like (\p\{L\}):\(\p\{L\}), and the replacement $1 $2.

asfimport commented 9 years ago

Itamar Syn-Hershko (migrated from JIRA)

You mean it implements UAX#29 version 6.3 :)
I'll likely be sending a PR for #1 sometime soon. Would you recommend using UAX#29 minus specific non-English tweaks, or fall back to ClassicStandardTokenizer which is English specific, or something else?
Here's the thing: the standard is wrong, or buggy. Ask any Swedish and they will tell you, and any non-Swedish corpus wouldn't care. And basically this is a bug in every Lucene based system today because of the word:word scenario; its a bit of an edge case but I bet I can find multiple occurrences in every big enough system. What can we do about that?

We already solved this using char filters, converting colons to a comma. It feels a bit hacky though, and again - this is a flaw in Lucene's analysis even though it conforms to a Unicode standard.

asfimport commented 9 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

In Lucene 4.7 through 4.10, yes, it implements the revision of UAX#29 associated with Unicode 6.3. I thought there was a JIRA to upgrade Lucene to Unicode 7.0, but I can't find it ATM. JFlex 1.6 and ICU 54.1 support Unicode 7.0.
I recommend a language-specific tailoring of UAX#29. There are tailoring notes in the standard you'll want to look at.
Unfortunately, I think the correct approach here is lobbying to change the standard.

asfimport commented 9 years ago

Itamar Syn-Hershko (migrated from JIRA)

Maybe out of scope of this ticket, but how do we go about #2? will be happy to take this discussion offline as well

asfimport commented 9 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Maybe out of scope of this ticket, but how do we go about #2? will be happy to take this discussion offline as well

Yeah, I'm not sure where the discussion should go, here's fine for me.

Prior to releasing new Unicode versions, PRIs (Public Review Issues) are created for proposed changes to individual standards: http://www.unicode.org/review/ - people can then submit comments, which are then considered for incorporation into the final standard. I don't see one there for UAX#29, but there have been for previous releases.

I think @rmuir is an individual member of the Unicode consortium - maybe he'll have some ideas on how to proceed?

asfimport commented 9 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Looks like the intent is to funnel all public input into future Unicode versions through this contact form: http://www.unicode.org/reporting.html - you could start there.

asfimport commented 9 years ago

Itamar Syn-Hershko (migrated from JIRA)

Sent them a request. I'll buy Robert beers if that could help pushing this forward!

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I really like beers, but i think I can only give some suggestions:

Maybe it would be good to figure out the exact 'diff' you recommend to the data files / specifications, and also any actual data to support why word breaks are better. Try to think of the general task of word breaks and why the change would be better and keep searching out of it, etc.

apache / lucene

StandardTokenizer doesn't tokenize word:word [LUCENE-6103] #7165