4.05-beta: c2x and c2d in non-oo notation is not compatible with classic rexx

jlfaucher commented 1 year ago

c2x It seems the rule "Insignificant leading zeros are removed" is applied to all the characters, but in this mode all the bytes should be kept. say c2x("René") -- 52656EE9 but the bytes are 0052 0065 006E 00E9

Regina and ooRexx: say c2x("0052 0065 006E 00E9"x) -- 00520065006E00E9

c2d say c2d("a") -- 97 as Regina and ooRexx say c2d("aa") -- 9797

Regina and ooRexx: say c2d("0061 0061"x) -- 6357089 (65536*97 + 97)

rvjansen commented 1 year ago

C2X Caused by NetRexx say "R".c2x yields 52, and the current implementation does that in a loop. But: ooRexx 6 - say c2x("René")

"52656EC3A9" So modulo unicode representation this seems the same. (in fact, say c2x('Rene') yields 52656E65 in NetRexx and ooRexx.) And amazingly (not) this is what I tested as the first test case and was satisfied they gave the same answer.

Sending a hex string into c2x is hard in NetRexx, the 'FFFF'x notation is not available, I can do 0xFFFF but that is not accepted; I never really saw the point of this although I am struggling with the same thing in cReXx.

I am not sure what you exactly mean with "the bytes are" 0052 etc - If I take my emacs and type in R, it saves 52 and not 0052 as the first byte, this is because it and my shell (iTerm2) are set to UTF-8. In the JVM I can dump it, but then it depends on if I am higher than Java 11 where the strings are compressed in most cases.

I would however, like to go to the bottom of this, so thank you very much for this report.

rvjansen commented 1 year ago

C2D this is way worse; here the implementations give a different answer and it is fair to say that I missed the mark here.

NetRexx: 7 = say c2d('aa')

"aa" "9797"

ooRexx: 7 - say c2d('aa')

"24929"

and z/VM: say c2d('aa')
33153
Ready; T=0.01/0.01 09:52:33

Thank you for bringing this up, I have a learning opportunity here.

rvjansen commented 1 year ago

I checked in a fix for c2d() and am testing.

jlfaucher commented 1 year ago

Hello René

Yes, c2x is working good, my bad. I made a wrong assumption, thinking the internal encoding is UTF16-BE. I'm still not clear about the internal encoding, but I guess it depends on the Java version and maybe on options. I use Java 19 under MacOS. With the code below, I could not find how to display properly an UTF-8 string (the string s3).

s0 = "René"
say s0                                      -- René
say s0.length                               -- 4
say c2x(s0)                                 -- 52656EE9

-- René, encoded UTF16-BE
s1 = "\x00\x52\x00\x65\x00\x6E\x00\xE9"
say s1                                      -- René
say s1.length                               -- 8
say c2x(s1)                                 -- 05206506E0E9

-- René, encoded what? could be Unicode codepoint 8-bit
s2 = "\x52\x65\x6E\xE9"
say s2                                      -- René
say s2.length                               -- 4
say c2x(s2)                                 -- 52656EE9

-- René, encoded UTF-8
s3 = "\x52\x65\x6E\xC3\xA9"
say s3                                      -- RenÃ©
say s3.length                               -- 5
say c2x(s3)                                 -- 52656EC3A9

jlfaucher commented 1 year ago

After reading again, I think c2x(s1) is not good. Since length is 8, I can assume that the internal byte sequence is 00 52 00 65 00 6E 00 E9, right?

c2x                     05206506E0E9
c2x with space          05 20 65 06 E0 E9
c2x aligned with bytes   0 52  0 65  0 6E  0 E9
bytes                   00 52 00 65 00 6E 00 E9

Another example where I think c2x should be 0D0A:

-- \r\n
s4 = "\r\n"
say "\\r\\n"                                -- \r\n
say s4.length                               -- 2
say c2x(s4)                                 -- DA

-- \r\n
s5 = "\x0D\x0A"
say "\\x0D\\x0A"                            -- \x0D\x0A
say s5.length                               --  2
say c2x(s5)                                 -- DA

ronyfla commented 1 year ago

Maybe interesting in this particular context:

"JEP-400: UTF-8 by Default" (https://openjdk.org/jeps/400) got delivered in the fall of 2021 with Java 18 (maybe interesting in this context as well https://en.wikipedia.org/wiki/Character_encodings_in_HTML)
Oracle in Java 8 (chapter "Text Representation"): https://docs.oracle.com/javase/8/docs/technotes/guides/intl/overview.html
Modified UTF-8: https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

rvjansen commented 1 year ago

Hi Jean-Louis, I agree about 0D0A, the c2x documentation states that the result string always should be twice the size of the input string. With NetRexx the prefix zero is suppressed, but I will unsuppress.

The other problem is how we should interpret the functioning of c2x in a Unicode world. The thing I strive for is a compatibility with 8bit char versions of Rexx (this particular instance of c2x has Classic Rexx compatibility as its main point of existence. So I will initially go for something that gives the same results as Regina and ooRexx, except for wider characters. We'll have to see what that looks like. It is one of the issues for the ARB.

jlfaucher commented 1 year ago

Hello René Ok for ARB discussion.

Netrexx (Java) is challenging because it doesn't work at 8-bit level, but instead at 16-bit level. This is independent of any internal Java optimization or Java default encoding. The current encoding has an impact when getting the bytes with str.toByteArray(), not when getting the characters with str.toCharArray or with str.charAt(i)

I wrote a netrexx script to (try to) understand what happens. I will put it in the ARB repository, if I can.

I'm not familiar with netrexx nor java, so I could be wrong somewhere... The first error I made was to assume that the escape \xhh is creating a 8-bit character. In fact, it's always a 16-bit character. \uhhhh is more obvious : it's also a 16-bit character.

The second error was to think that str.toByteArray() returns the internal representation of the string. In fact, it's an encoding of the internal representation using the default encoding. The internal representation is not accessible, so it's different from Regina and ooRexx.

The script shows that c2x(str) should always use 4 hex digits per character. You will see that with the output of example 11 and 12. Both have the same result for c2x, but they should be different.

rvjansen commented 1 year ago

The problem here is that there are usage patterns where it is assumed that c2x() works on 8bit bytes, like (coincidentally today) a discussion on the mainframe list where someone unpacks a field in DCOLLECT output using C2X; same for translate and the other bifs we discussed on the ARB list. We probably need a set of these (limited) cases where implementations offer a 8bit byte version of a char and one where it handles Unicode correctly. (Representations of Unicode characters also can be 3 or 4 bytes). I would suggest wc2x, wc2c etc. Unfortunately in C uchar means unsigned char, otherwise I would have preferred uc2d, uc2x (or we can brave it and do it anyway, because 'unsigned char' in itself is an abomination).

RexxLA / NetRexx

4.05-beta: c2x and c2d in non-oo notation is not compatible with classic rexx #50