CJK char list [LUCENE-478]

asfimport commented 18 years ago

Seems the character list in the CJK section of the StandardTokenizer.jj is not quite complete. Following is a more complete list:

<CJK: // non-alphabets [ "\u1100"-"\u11ff", "\u3040"-"\u30ff", "\u3130"-"\u318f", "\u31f0"-"\u31ff", "\u3300"-"\u337f", "\u3400"-"\u4dbf", "\u4e00"-"\u9fff", "\uac00"-"\ud7a3", "\uf900"-"\ufaff", "\uff65"-"\uffdc"
] >

Migrated from LUCENE-478 by John Wang, 2 votes, resolved Aug 13 2006 Attachments: StandardTokenizer.jj.diff (versions: 2)

asfimport commented 18 years ago

Daniel Naber (migrated from JIRA)

This is how the code looks currently:

<CJ: // Chinese, Japanese [ "\u3040"-"\u318f", "\u3300"-"\u337f", "\u3400"-"\u3d2d", "\u4e00"-"\u9fff", "\uf900"-"\ufaff" ] >
<KOREAN: // Korean [ "\uac00"-"\ud7af" ] >

Are your suggested changes still needed and if so, where should which range be added (Chinese/Japanese or Korean)?

asfimport commented 18 years ago

John Wang (migrated from JIRA)

Yes I am.

Our i18n team has provided a more up-to-date list and I thought I'd contribute it back.

-John

asfimport commented 18 years ago

Daniel Naber (migrated from JIRA)

John, I'm not sure I understand: do you think that this issue can be closed now? If not, could you ask your i18n experts how your changes could be integrated into the current code (the one where K/Korean and CJ are separate things)?

asfimport commented 18 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

There are six classes of issues:

A character range in StandardTokenizer.jj that is missing in John's list, and should be left as-is in StandardTokenizer.jj (in the <CJ> section):

1.a. [ U+3100 - U+312F ] BoPoMoFo (a.k.a. ZhuYin): Phonetic transcription symbols used in Taiwan; not used on mainland China.
A character range in StandardTokenizer.jj that is also in John's list, but in the <LETTER> section rather than in the <CJ> section, and should be left as-is:

2.a. [ U+1100 - U+11FF ] Korean Jamo (phonetic symbols)
A character range in StandardTokenizer.jj that is not present in John's list, and that should be removed from the <KOREAN> section in StandardTokenizer.jj:

3.a. [ U+D7A4 - U+D7AF ] Non-character range at the end of the pre-composed Hangul (Korean) block
A character range in John's list that is missing in StandardTokenizer.jj, but which was not present in Unicode 3.0, and so strictly should not be included when running on Java 1.4; since this is a non-character range in Unicode 3.0, however, I think it should be included in StandardTokenizer.jj (in the <CJ> section) for future compatibility with Java 1.5 and Unicode 4.0:

4.a. [ U+31F0 - U+31FF ] Japanese Katakana phonetic extensions; these were introduced in Unicode version 3.2 (see http://www.unicode.org/reports/tr28/tr28-3.html#10_3_katakana )
Character ranges in John's list that are missing in StandardTokenizer.jj, and that should be added to the newly re-labeled <CJ> section:

5.a. [ U+FF65 - U+FF9F ] Half-width Japanese Katakana (phonetic symbols)

5.b. [ U+3d2e - U+4DB5 ] (non-chars [ U+4DB6 - U+4DBF ] excluded) CJK Ideograph Extension A.
This range was introduced in Unicode 3.0.
A character range in John's list that is missing in StandardTokenizer.jj, and that should be added to the <LETTER> section, since it, like the [ U+1100 - U+11FF ] range already included there, is a range of Korean Jamo (phonetic symbols):

6.a. [ U+FFA0 - U+FFDC ] Half-width Korean Jamo (phonetic symbols)

asfimport commented 18 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Patch addressing the above-described issues

asfimport commented 18 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Removed stray comma - obsoletes previous patch

asfimport commented 18 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

Thanks, I committed Steven Rowe's patch, although it doesn't seem to fully match what he said in comments above (e.g. in his patch, I don't see the range he mentioned in 5.b).

apache / lucene

CJK char list [LUCENE-478] #1556