Document string case conversion as ASCII-only

gbdev / rgbds

Rednex Game Boy Development System - An assembly toolchain for the Nintendo Game Boy and Game Boy Color

https://rgbds.gbdev.io

MIT License

1.35k stars 172 forks source link

Document string case conversion as ASCII-only #639

Closed ISSOtm closed 1 year ago

ISSOtm commented 3 years ago

STRUPR and STRLWR do not handle non-ASCII text properly, as in the STRUPR call below:

issotm@sheik-kitty ~/rgbds% cat test/asm/string.asm
    PRINTT STRCAT("Left", "right\n")
    PRINTT STRUPR("Garçon, café, s'il vous plaît !\n")
    PRINTT STRLWR("\"Hello!\" 「今日は！」\n")
issotm@sheik-kitty ~/rgbds% ./rgbasm test/asm/string.asm
Leftright
GARçON, CAFé, S'IL VOUS PLAîT !
"hello!" 「今日は！」

Processing Unicode correctly beyond UTF8 encoding/decoding is difficult, so it would probably be best to use an external library for this. 0.4.3 / 0.5.0 already changed dependencies (Yacc → Bison), so this is probably a good opportunity. Two questions, then:

Which library should we use, or should we roll our own? The Unicode consortium FAQ recommends ICU.
Should we directly link against it (handled by compiler, cross-platform, no extra complexity), or dynamically load it (dependency optional)?

aaaaaa123456789 commented 3 years ago

ICU's license is a bit special, so you might want to consider making it an optional component, which would allow you to not distribute it with RGBDS (even in binary form).

Rangi42 commented 2 years ago

The ICU library (libicudata.a, libicui18n.a, libicuio.a, libicutest.a, libicutu.a, libicuuc.a) is around 38MB and libunistring (libunistring.a) is around 2MB, which is unacceptable for static linking. Both take many minutes to compile even on a good computer and require a lot of dependencies, including Python for ICU. On the other hand libgrapheme (libgrapheme.a) only weighs in at around 40K and is compiled (including Unicode data parsing) in fractions of a second, requiring nothing but a C99 compiler and make(1).

While ICU and libunistring offer a lot of functions and the weight mostly comes from locale-data provided by the Unicode standard, which is applied implementation-specifically (!) for some things, the same standard always defines a sane 'default' behaviour as an alternative in such cases that is satisfying in 99% of the cases and which you can rely on.

-- https://libs.suckless.org/libgrapheme/

Rangi42 commented 1 year ago

If the only thing we need more Unicode handling for is case conversion, https://github.com/rust-lang/rust/blob/master/library/core/src/unicode/unicode_data.rs looks portable without needing an entire ICU library.

ISSOtm commented 1 year ago

Honestly I don't think we want to depend on a version of the Unicode Standard, and given RGBASM's existing ASCII reliance, I'm of the opinion that we should define the case conversion functions to only work on ASCII?

Rangi42 commented 1 year ago

Yeah, that would be sensible.

aaaaaa123456789 commented 1 year ago

I'm happy with that approach.