aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.15k stars 553 forks source link

compiledcode: java class strings are not utf-8 but mutf-8 #3145

Open armijnhemel opened 2 years ago

armijnhemel commented 2 years ago

Description

The parser for Java byte code is incorrect: it treats strings (incl method names, class names, and so on) as UTF-8, but according to the Java class specification these are MUTF-8:

https://docs.oracle.com/javase/specs/jvms/se12/html/jvms-4.html#jvms-4.4.7

While for most strings they are interchangeable it will be different once you start getting non-Western strings (example: CJK). If strings are important to you, then you should add another step and that's translating the strings. For this I am using the mutf8 package https://github.com/TkTech/mutf8/ although I have a feeling that there are still a few bugs in that package (that I need to chase).

MUTF-8 is also used in Android byte code.

pombredanne commented 2 years ago

@armijnhemel Thanks.. @TkTech 's code is awesome. I have been using Jawa elsewhere (now https://github.com/TkTech/Lawu ) and we should like switch to using this exclusively and contribute upstream and fixes.

pombredanne commented 2 years ago

IMHO Lawu would best replace this https://github.com/nexB/scancode-plugins/blob/main/binary-analysis/scancode-compiledcode/src/compiledcode/javaclass/javaclass.py which does not even support the most recent updates to the Java class file format

TkTech commented 2 years ago

If you run into bugs with mutf8, just let me know. Happy to debug.

For Lawu, I strongly suggest keeping what you have for a couple of weeks, as there's going to be a major version bump and major breaking changes as the long time develop branch becomes master.