Open armijnhemel opened 2 years ago
@armijnhemel Thanks.. @TkTech 's code is awesome. I have been using Jawa elsewhere (now https://github.com/TkTech/Lawu ) and we should like switch to using this exclusively and contribute upstream and fixes.
IMHO Lawu would best replace this https://github.com/nexB/scancode-plugins/blob/main/binary-analysis/scancode-compiledcode/src/compiledcode/javaclass/javaclass.py which does not even support the most recent updates to the Java class file format
If you run into bugs with mutf8, just let me know. Happy to debug.
For Lawu, I strongly suggest keeping what you have for a couple of weeks, as there's going to be a major version bump and major breaking changes as the long time develop branch becomes master.
Description
The parser for Java byte code is incorrect: it treats strings (incl method names, class names, and so on) as UTF-8, but according to the Java class specification these are MUTF-8:
https://docs.oracle.com/javase/specs/jvms/se12/html/jvms-4.html#jvms-4.4.7
While for most strings they are interchangeable it will be different once you start getting non-Western strings (example: CJK). If strings are important to you, then you should add another step and that's translating the strings. For this I am using the mutf8 package https://github.com/TkTech/mutf8/ although I have a feeling that there are still a few bugs in that package (that I need to chase).
MUTF-8 is also used in Android byte code.