Big5 kana are mapped to PUA (glyph not found error)

timdream commented 6 years ago

Name of the game:

An old private game between friends for holiday greeting (Sorry for not being able to release it w/o permission)

Player platform:

Web

Describe the issue in detail and how to reproduce it:

I was super excited to revive the old game from my backups with EasyRPG! The game runs without a problem, except for the fact that two characters in the dialog were rendered as �. The console shows

Debug: glyph not found: 0xf70b
Debug: glyph not found: 0xf749

Judging by the context I guess the characters should be either くん or さん. Given that U+F70B and U+F749 both lies in PUA area, I am pretty sure the bug is caused by the fact that the game was developed on a patched Windows XP with kanas mapped to PUA.

(Context: back in the days, Windows with codepage set to CP950 cannot encode Japanese kanas unless the user patches it with Big5-UAO or Sakura IME. Big5-UAO replaces the CP950 mapping table to map the kanas to PUA area, and Sakura IME installs custom glyphs in the PUA area; both enables non-Unicode applications, RPG Maker 2003 in this case, to render kanas.)

The way to fix this would be to ask EasyRPG not to decode Big5 bytes with CP950, but with a patched Big5-UAO mapping. I, however, couldn't find the decoding table in the repo. Maybe EasyRPG relies on the platform to do the decoding?

Let me know where the decoding table is and I can help patching it. I am pretty sure we could fix this with the mapping table here.

I understand this is not a standard setup at all even at the time, so I am fine if we want to WONTFIX this. Thanks for your consideration!

Ghabry commented 6 years ago

That's an interesting problem with encodings again.

We use ICU for handling the codepage conversions (or iconv, but our official ports all use ICU). Because the missing Glyphs are reported in the PUA already it seems that ICU already supports this feature?

This thread https://bugs.chromium.org/p/chromium/issues/detail?id=277868 confirms it: "ICU's windows-950-2000.ucm that we use for Big5 has quite a lot of mappings to PUA code points"

As long as there is no other codepage abusing the PUA I don't see a problem in mapping glyphs into it.

To get anything rendered you have to modify our built-in font: https://github.com/EasyRPG/Player/tree/master/resources/shinonome

Copy-paste one of the font-folders (containing a .bit) file
Open the .bit file, change CHARSET_REGISTRY to "UTF32-LE" (if not already the case)
A char goes from STARTCHAR to ENDCHAR
Set the value after STARTCHAR to the value of the codepoint (see other .bit files with "UTF32-LE" for reference)
For full width set the BBX to 12 12 0 -2, for half width 6 12 0 -2
Draw the glyph (one of the files already contains kana)
Add a reference to that new bit file to the generate_cxx_font.rb ruby script.
Run the script
Recompile Player

timdream commented 6 years ago

Thanks for the quick reply. Is it possible to set multiple codepoints to the same glyph? That feels like a better approach than duplicating the glyphs.

It makes sense for EasyRPG to use ICU/iconv from the platform that supports it.

Ghabry commented 6 years ago

At least I'm not aware of a way to assign one glyph to multiple codepoints.

Don't think we had that use case before? (Maybe @rohkea can help)

timdream commented 6 years ago

We use ICU for handling the codepage conversions (or iconv, but our official ports all use ICU). Because the missing Glyphs are reported in the PUA already it seems that ICU already supports this feature?

This thread https://bugs.chromium.org/p/chromium/issues/detail?id=277868 confirms it: "ICU's windows-950-2000.ucm that we use for Big5 has quite a lot of mappings to PUA code points"

@Ghabry It's interesting to rethink this from the decoding perspective too — I assume the Emscripten toolchain pull it's own impl/port of libicu, as opposed to using the TextDecoder Web API? Because if so, it would actually decode the kanas correctly, since the Big5 decoding table created in the WHATWG Encoding Standard actually accounts that.

Ghabry commented 6 years ago

I assume the Emscripten toolchain pull it's own impl/port of libicu

Yes, it uses a stripped down ICU datafile (because the normal datafile is 20 MB) and for Big5 we use windows-950-2000.ucm. We don't use any web apis if it can be avoided.

rohkea commented 6 years ago

‘神秘世界之旅 - 迷失的心’ (Adventures in the Mysterious World: The Lost of the Hearts) is probably affected by this. In the talk when meeting Anderson, a PUA character U+E2E5 is used:

screenshot_27

Here’s a savegame for this moment: Save01.lsd (walk up the door after loading and you’ll have this dialog).

rohkea commented 2 months ago

I'm now looking into this, and I'm not sure PUA characters to the font is a good solution: now we have TTF font support, and the user might want to drop their own font — and they will probably not have PUA characters.

Maybe we could just change the mapping? Or add a separate postprocessing step?

EasyRPG / Player