Closed asfimport closed 18 years ago
Daniel Naber (migrated from JIRA)
This is how the code looks currently:
<CJ: // Chinese, Japanese [ "\u3040"-"\u318f", "\u3300"-"\u337f", "\u3400"-"\u3d2d", "\u4e00"-"\u9fff", "\uf900"-"\ufaff" ] > |
---|
<KOREAN: // Korean [ "\uac00"-"\ud7af" ] > |
Are your suggested changes still needed and if so, where should which range be added (Chinese/Japanese or Korean)?
John Wang (migrated from JIRA)
Yes I am.
Our i18n team has provided a more up-to-date list and I thought I'd contribute it back.
-John
Daniel Naber (migrated from JIRA)
John, I'm not sure I understand: do you think that this issue can be closed now? If not, could you ask your i18n experts how your changes could be integrated into the current code (the one where K/Korean and CJ are separate things)?
Steven Rowe (@sarowe) (migrated from JIRA)
There are six classes of issues:
A character range in StandardTokenizer.jj that is missing in John's list, and should be left as-is in StandardTokenizer.jj (in the <CJ> section):
1.a. [ U+3100 - U+312F ] BoPoMoFo (a.k.a. ZhuYin): Phonetic transcription symbols used in Taiwan; not used on mainland China.
A character range in StandardTokenizer.jj that is also in John's list, but in the <LETTER> section rather than in the <CJ> section, and should be left as-is:
2.a. [ U+1100 - U+11FF ] Korean Jamo (phonetic symbols)
A character range in StandardTokenizer.jj that is not present in John's list, and that should be removed from the <KOREAN> section in StandardTokenizer.jj:
3.a. [ U+D7A4 - U+D7AF ] Non-character range at the end of the pre-composed Hangul (Korean) block
A character range in John's list that is missing in StandardTokenizer.jj, but which was not present in Unicode 3.0, and so strictly should not be included when running on Java 1.4; since this is a non-character range in Unicode 3.0, however, I think it should be included in StandardTokenizer.jj (in the <CJ> section) for future compatibility with Java 1.5 and Unicode 4.0:
4.a. [ U+31F0 - U+31FF ] Japanese Katakana phonetic extensions; these were introduced in Unicode version 3.2 (see http://www.unicode.org/reports/tr28/tr28-3.html#10_3_katakana )
Character ranges in John's list that are missing in StandardTokenizer.jj, and that should be added to the newly re-labeled <CJ> section:
5.a. [ U+FF65 - U+FF9F ] Half-width Japanese Katakana (phonetic symbols)
5.b. [ U+3d2e - U+4DB5 ] (non-chars [ U+4DB6 - U+4DBF ] excluded)
CJK Ideograph Extension A.
This range was introduced in Unicode 3.0.
A character range in John's list that is missing in StandardTokenizer.jj, and that should be added to the <LETTER> section, since it, like the [ U+1100 - U+11FF ] range already included there, is a range of Korean Jamo (phonetic symbols):
6.a. [ U+FFA0 - U+FFDC ] Half-width Korean Jamo (phonetic symbols)
Steven Rowe (@sarowe) (migrated from JIRA)
Patch addressing the above-described issues
Steven Rowe (@sarowe) (migrated from JIRA)
Removed stray comma - obsoletes previous patch
Otis Gospodnetic (@otisg) (migrated from JIRA)
Thanks, I committed Steven Rowe's patch, although it doesn't seem to fully match what he said in comments above (e.g. in his patch, I don't see the range he mentioned in 5.b).
Seems the character list in the CJK section of the StandardTokenizer.jj is not quite complete. Following is a more complete list:
<CJK: // non-alphabets [ "\u1100"-"\u11ff", "\u3040"-"\u30ff", "\u3130"-"\u318f", "\u31f0"-"\u31ff", "\u3300"-"\u337f", "\u3400"-"\u4dbf", "\u4e00"-"\u9fff", "\uac00"-"\ud7a3", "\uf900"-"\ufaff", "\uff65"-"\uffdc"
] >
Migrated from LUCENE-478 by John Wang, 2 votes, resolved Aug 13 2006 Attachments: StandardTokenizer.jj.diff (versions: 2)