aboutcode-org / scancode.io

ScanCode.io is a server to script and automate software composition analysis pipelines with ScanPipe pipelines. This project is sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase/ Google Summer of Code, nexB and others generous sponsors!
https://scancodeio.readthedocs.io
Apache License 2.0
115 stars 85 forks source link

Android APK D2D: Match converted JARs to PurlDB #1371

Closed pombredanne closed 1 month ago

pombredanne commented 2 months ago

Once converted from classes.dex back to a Java JAR, we need to match these to the PurlDB. This is special because the extracted Android bytecode that is converted back to a Java .class file or a JAR will not be the same exactly as the original Java bytecode this was derived from. We need to create special techniques for this.

JonoYang commented 1 month ago

The purldb directory matching step would be useful here. jadx decompiles dex files to a directory adjacent to the dex file, where the classes in the dex files are converted to java source. We can fingerprint those directories and match them against the purldb. A caveat I can think of now would be that we would match to the source distribution of a java package rather than the binary, since the fingerprinted directories from the dex files would have java files instead of class files.

mjherzog commented 1 month ago

From recent Android project experience, my understanding of JADX is that (1) it will produce a set of .class files and a set of .source files neither of which will be a fingerprint match to the original Java code and (2) the decompiled source will be more divergent from the original source than the binaries. So wouldn't fuzzy matching for the .class files be the most important first step?

JonoYang commented 1 month ago

@mjherzog

I was thinking in terms of using the the directory structure fingerprints, where the fingerprints are created from the paths of the resources within it. We would not match anything on the binary level since the decompilation could vary.

mjherzog commented 1 month ago

That makes sense - I had missed that point.

chinyeungli commented 1 month ago

The directory structure fingerprints makes sense.

JonoYang commented 1 month ago

There is a step in the android d2d pipeline for matching directories to packages on purldb.

pombredanne commented 1 month ago

Thanks for completing this! Some specific issues wrt. Kotlin are tracked in:

pombredanne commented 1 month ago

As a follow up there are several refinements we can implement. These are tracked in: