Raku / nqp

NQP
Other
336 stars 131 forks source link

[JVM] Several opcodes for string (e.g. chars, substr*) work on Java's chars (UTF-16) instead of graphemes #783

Open usev6 opened 1 year ago

usev6 commented 1 year ago

For the JVM backend various Unicode related tests (e.g. in https://github.com/Raku/roast/) fail, because some opcodes for strings don't work on graphemes, but on Java's chars.

Examples:

$ ./rakudo-m -e 'my Str $u = "\x[0043,0323]"; say "$u -- chars: " ~ $u.chars'
C̣ -- chars: 1
$ ./rakudo-j -e 'my Str $u = "\x[0043,0323]"; say "$u -- chars: " ~ $u.chars'
C̣ -- chars: 2
$ ./rakudo-m -e 'my $str = join "", 0x10426.chr, 0x10427.chr; say $str.chars; say substr($str, 0, 1).uniname; say substr($str, 1, 1).uniname'
2
DESERET CAPITAL LETTER OI
DESERET CAPITAL LETTER EW
$ ./rakudo-j -e 'my $str = join "", 0x10426.chr, 0x10427.chr; say $str.chars; say substr($str, 0, 1).uniname; say substr($str, 1, 1).uniname'
4
<surrogate-D801>
<surrogate-DC26>

The problem is even mentioned in Rakudo's documentation on routine chars:

Please note that on the JVM, you currently get codepoints instead of graphemes.

I'm not sure if this can be solved without fully supporting NFG (https://github.com/Raku/nqp/issues/241). But at least I want to use this issue as a reference for fudged tests.