Incorrect emission of MUTF-8

toddATavail commented 7 years ago

Given the instruction

ldc_w "St. Mary's Hospital — Janesville"

assemble.py produces the following constant pool entry

00000280  xx xx xx xx xx xx xx xx  xx xx xx xx xx xx 01 25  |...............%|
00000290  53 74 2e 20 4d 61 72 79  27 73 20 48 6f 73 70 69  |St. Mary's Hospi|
000002a0  74 61 6c 20 c3 a2 c2 80  c2 94 20 4a 61 6e 65 73  |tal ...... Janes|
000002b0  76 69 6c 6c 65 xx xx xx  xx xx xx xx xx xx xx xx  |ville...........|

as revealed by doing a hexdump -C of the resulting class file. (I have elided bytes outside of this case with x's in the hex dump and periods in the ASCII.)

Based on 4.4.7 of the JVM Specification, the 6-byte encoding should not be applied to characters in the BMP. The text of 4.4.7 says:

Code points in the range '\u0800' to '\uFFFF' are represented by 3 bytes x, y, and z

The character that is being improperly encoded here is em-dash — (U+2014). The incorrect bytes here are c3 a2 c2 80 c2 94, and the expected bytes are e2 80 94. Therefore, all else being equal, the expected bytes for the entire constant pool entry should be:

00000280  xx xx xx xx xx xx xx xx  xx xx xx xx xx xx 01 25  |...............%|
00000290  53 74 2e 20 4d 61 72 79  27 73 20 48 6f 73 70 69  |St. Mary's Hospi|
000002a0  74 61 6c 20 e2 80 94 20  4a 61 6e 65 73 76 69 6c  |tal ... Janesvil|
000002b0  6c 65 xx xx xx xx xx xx  xx xx xx xx xx xx xx xx  |le..............|

Do you have time to investigate this? I have a product release deadline coming up on Wednesday February 8. If you don't have time write now to dig into it, then I can probably knock it out myself and send you a pull request.

Thanks!

Storyyeller commented 7 years ago

Just based on the bytes listed, it looks like there is a double encoding issue. The em dash was encoded into three bytes, and then each of those bytes was UTF8 encoded again, turning into two bytes.

toddATavail commented 7 years ago

Thank you! Much appreciated. I'll hold tight for now, and switch focus to something else that I need to build for a few days.

Storyyeller commented 7 years ago

It turns out that the issue is that the assembler is not designed to handle raw unicode in the input file. You're supposed to use escape sequences.

assembler.txt says

Non printable and non-ascii characters, including tabs, are not allowed. These can be represented by escape sequences as usual. For example \t means tab.

It might be worth adding a warning to avoid confusion in the future though.

Storyyeller commented 7 years ago

By the way, what are you using Krakatau for? If you provide more information, I might be able to give some tips.

toddATavail commented 7 years ago

Okay, I was confused because assembler.txt also says:

* There are two types of string literals - bytes and unicode. Unicode strings are the default and represent a sequence of code points which will be MUTF8 encoded when written to the classfile. A byte string, represented by prefixing with b or B, represents a raw sequence of bytes which will be written unchanged. For example, "\0" is encoded to a two byte sequence while b"\0" puts an actual null byte in the classfile (which is invalid, but potentially useful for testing).

The language here suggests (to me, at least) that Unicode code points are acceptable within string literals, which is the only place that I'm putting them.

So I need to generate escape sequences instead? I can do that if necessary.

toddATavail commented 7 years ago

I'm using Krakatau as the target of a compiler for a high-performance dynamic rule engine. I designed the application and the specification language, and then wrote the compiler, rule engine, server, and UI from scratch. The classes that I'm generating are concrete subclasses of interface hierarchies that are on-demand loaded by the rule engine.

toddATavail commented 7 years ago

Okay, I just verified that the specific character that I encountered a problem with can indeed be escaped as \u2014 in the string literal, and the generated code is correct. I'll just escape the appropriate characters during the final prep before emission. Thanks for your help.

Storyyeller commented 7 years ago

Sorry, the assembler documentation is confusing. What I meant is that unicode string literals can include escaped unicode, (and the result will be MUTF8 encoded).

P.S. What language did you write your compiler in? Java?

Storyyeller commented 7 years ago

By the way, I'm considering rewriting the assembler in Rust, and I was wondering if this would cause a problem for you.

toddATavail commented 7 years ago

Rust produces standalone binaries, so that should actually be better for me in the long run — I always have to update the bang path whenever I pull a new Krakatau from GitHub so that it will find the right Python on my system. In the short term, there will be new bugs that are the inevitable consequences of starting a new project or porting an old one, so I would probably stay away from the new version for a while until things calm down and firm up.

Krakatau is pretty stable right now, and you answer questions and turn around issues quickly, so I say you should go for it if you want. I presume that you would still provide some support if people find critical bugs in the soon-to-be legacy Python implementation?

I've written a few small projects in Rust. The borrow checker takes some adjustment, but the whole experience feels less fraught than C++. Wish I had the time to volunteer to help you with the port — sounds like a fun project. :)

Storyyeller commented 7 years ago

I'm not really sure what I would do with the Python version if I did rewrite in Rust. One possibility would be to declare it unmaintened and do future development on the Rust version, but I wouldn't want to leave people hanging if there's some reason that they can only use Python and not Rust. What would be ideal is some way to run Rust code in Python, an interpeter/emulator perhaps.

Storyyeller commented 7 years ago

Hi, I'm contacting you because you have shown interest in the Krakatau assembler in the past. I am currently investigating adding Java 9 support, so I was wondering if you have any opinions or thoughts about syntax for the new features (mostly the Module attribute as the other additions are fairly trivial).

Storyyeller / Krakatau

Incorrect emission of MUTF-8 #113