Closed pombredanne closed 1 month ago
The purldb directory matching step would be useful here. jadx decompiles dex files to a directory adjacent to the dex file, where the classes in the dex files are converted to java source. We can fingerprint those directories and match them against the purldb. A caveat I can think of now would be that we would match to the source distribution of a java package rather than the binary, since the fingerprinted directories from the dex files would have java files instead of class files.
From recent Android project experience, my understanding of JADX is that (1) it will produce a set of .class files and a set of .source files neither of which will be a fingerprint match to the original Java code and (2) the decompiled source will be more divergent from the original source than the binaries. So wouldn't fuzzy matching for the .class files be the most important first step?
@mjherzog
I was thinking in terms of using the the directory structure fingerprints, where the fingerprints are created from the paths of the resources within it. We would not match anything on the binary level since the decompilation could vary.
That makes sense - I had missed that point.
The directory structure fingerprints makes sense.
There is a step in the android d2d pipeline for matching directories to packages on purldb.
Thanks for completing this! Some specific issues wrt. Kotlin are tracked in:
As a follow up there are several refinements we can implement. These are tracked in:
Once converted from classes.dex back to a Java JAR, we need to match these to the PurlDB. This is special because the extracted Android bytecode that is converted back to a Java .class file or a JAR will not be the same exactly as the original Java bytecode this was derived from. We need to create special techniques for this.