Unicode fixes - Githubissues

lemzwerg commented 8 years ago

Please revise; some decisions might be questionable.

Pomax commented 8 years ago

While the XeTeX lists mulls over the request to up the class numeral from unint8 to uint16, not loading some of the more exotic classes seems sensible.

Just thinking out loud, we could also only load plane 0 by default, with a \plane1, \plane2, \plane14, \plane15, and \plane16 command to load in the additional planes as necessary?

lemzwerg commented 8 years ago

I rather suggest that some blocks get merged. For example, there is zero advantage of having three blocks for 'Myanmar', 'MyanmarExtendedA', and 'MyanmarExtendedB' – a single block would be both better and saving \XeTeXcharclass registers.

However, this needs some recoding, and I don't have time for that, unfortunately.

Pomax commented 8 years ago

Hm, that feels like a thing I can get done with a quick script or even a bit of Sublime Text'ing, although figuring out which blocks can be collapsed that way will indeed take a little bit of time.

lemzwerg commented 8 years ago

Well, here are groups of the simpler cases, which should be sufficient for the time being...

\do{Arabic}{"0600}{"06FF}
\do{ArabicExtendedA}{"08A0}{"08FF}
\do{ArabicPresentationFormsA}{"0FB50}{"0FDFF}
\do{ArabicPresentationFormsB}{"0FE70}{"0FEFF}
\do{ArabicSupplement}{"0750}{"077F}

\do{Bamum}{"0A6A0}{"0A6FF}
\do{BamumSupplement}{"016800}{"016A3F}

\do{BasicLatin}{"0020}{"007F} % 0000..007F in Unicode standard
\do{LatinExtendedA}{"0100}{"017F}
\do{LatinExtendedAdditional}{"01E00}{"01EFF}
\do{LatinExtendedB}{"0180}{"024F}
\do{LatinExtendedC}{"02C60}{"02C7F}
\do{LatinExtendedD}{"0A720}{"0A7FF}
\do{LatinExtendedE}{"0AB30}{"0AB6F}
\do{LatinSupplement}{"0080}{"00FF}

\do{Bopomofo}{"03100}{"0312F}
\do{BopomofoExtended}{"031A0}{"031BF}

\do{Cherokee}{"013A0}{"013FF}
\do{CherokeeSupplement}{"0AB70}{"0ABBF}

\do{Coptic}{"02C80}{"02CFF}
\do{CopticEpactNumbers}{"0102E0}{"0102FF}

\do{Cyrillic}{"0400}{"04FF}
\do{CyrillicExtendedA}{"02DE0}{"02DFF}
\do{CyrillicExtendedB}{"0A640}{"0A69F}
\do{CyrillicSupplement}{"0500}{"052F}

\do{Devanagari}{"0900}{"097F}
\do{DevanagariExtended}{"0A8E0}{"0A8FF}

\do{Ethiopic}{"01200}{"0137F}
\do{EthiopicExtended}{"02D80}{"02DDF}
\do{EthiopicExtendedA}{"0AB00}{"0AB2F}
\do{EthiopicSupplement}{"01380}{"0139F}

\do{Georgian}{"010A0}{"010FF}
\do{GeorgianSupplement}{"02D00}{"02D2F}

\do{GreekAndCoptic}{"0370}{"03FF}
\do{GreekExtended}{"01F00}{"01FFF}

\do{HangulCompatibilityJamo}{"03130}{"0318F}
\do{HangulJamo}{"01100}{"011FF}
\do{HangulJamoExtendedA}{"0A960}{"0A97F}
\do{HangulJamoExtendedB}{"0D7B0}{"0D7FF}
\do{HangulSyllables}{"0AC00}{"0D7AF}

\do{Khmer}{"01780}{"017FF}
\do{KhmerSymbols}{"019E0}{"019FF}

\do{MeeteiMayek}{"0ABC0}{"0ABFF}
\do{MeeteiMayekExtensions}{"0AAE0}{"0AAFF}

\do{Myanmar}{"01000}{"0109F}
\do{MyanmarExtendedA}{"0AA60}{"0AA7F}
\do{MyanmarExtendedB}{"0A9E0}{"0A9FF}

\do{Sinhala}{"0D80}{"0DFF}
\do{SinhalaArchaicNumbers}{"0111E0}{"0111FF}

\do{Sundanese}{"01B80}{"01BBF}
\do{SundaneseSupplement}{"01CC0}{"01CCF}

\do{UnifiedCanadianAboriginalSyllabics}{"01400}{"0167F}
\do{UnifiedCanadianAboriginalSyllabicsExtended}{"018B0}{"018FF}

Pomax commented 8 years ago

thanks!

Pomax commented 8 years ago

Given the v2.1 release for ucharclasses so it works with XeTeX 0.99996, would you be willing to rebase this PR?

lemzwerg commented 8 years ago

Rebased. There's now also code to cater for the LaTeX override, using code suggested by David Carlisle.

lemzwerg commented 8 years ago

It's not clear to me what you mean with 'duplication'. Please elaborate.

Pomax commented 8 years ago

On line 44 we have:

\ifdefined\XeTeXinterwordspaceshaping
  \def\newXeTeXintercharclass{%
    \e@alloc\XeTeXcharclass\chardef\xe@alloc@intercharclass\m@ne{4095 }}
\fi

but the 0.99994 fix also introduced this on line 806:

\ifdefined\XeTeXinterwordspaceshaping
  \chardef\@ucharclass@boundary=4095 %
\else
  \chardef\@ucharclass@boundary=\@cclv
\fi

looking at it closer that's not duplication, but should those two things be grouped into a single \ifdefined...\fi block?

lemzwerg commented 8 years ago

Probably yes; it would be a minor follow-up patch.

Pomax commented 8 years ago

wfm, I've filed https://github.com/Pomax/ucharclasses/issues/17 for that purposes and will merge this in. Would you like to be credited in the .sty file and README for the v2.2 update this will lead to?

Pomax / ucharclasses

Unicode fixes #12