Closed x4e closed 3 years ago
Hey, thanks for the report. I've just started digging into to it, and a couple of things stand out right away.
First:
javap
can't even decode the strings correctly.
That to me is a big flag that whatever the strings contain, it isn't UTF-8.
Second:
If this class was generated by javac, then this wouldn't have flown, unless the literal itself was embedded with these weird characters.
One of my thoughts is that it was generated from something other than the java compiler.
Third: Upon looking at some of the bad bytes, it looks like it's being detected as a lone surrogate pair, which, of course, is invalid UTF-8.
I'm still looking into it, but my suspicion is that whatever generated it, didn't generate it correctly according to the jvm classfile spec.
Although, I could have very much forgotten something with surrogates. As I said, I'm still diving into it, and I'll keep you updated.
Ok, going deeper seems to reflect that whatever it is, isn't UTF8.
My guess would be the thing that's generating it doesn't generate valid UTF-8 sequences.
Because it's obviously an encoder of some kind, I would have assumed it's some form of decision table encoded as a string. Whatever generated it, forgot that it needs to keep the bytes out of the surrogate plane (D800–DFFF), as it's not valid UTF-8.
The first sequence of bytes that break it are 0xEDA2A2
.
And decoding, you get 0xD8A2
, and that's in the reserved surrogate space.
So, the mutf8 crate is fine. It's decoding the mutf8 sequence without issue. The problem, unfortunately, lies in the content itself.
Let me know if there's anything else I can do.
Ok thanks for the help! Weird how this class is included in the jdk8 runtime... It does seem to be generated by javac however (http://www.docjar.com/html/api/sun/awt/motif/X11GB18030_0$Encoder.java.html).
Anyway, for now I guess I will do the same as javac and do a lossy parse. Thanks again for the help.
Parsing the utf8 strings contained in the class file in the following zip cause a malformed utf8 output (constant pool entry #53): X11GB18030_0$Encoder.zip
Code:
Error returned from String::from_utf8:
Raw string bytes:
Decoded utf8 bytes: