aboutcode-org / scancode.io

ScanCode.io is a server to script and automate software composition analysis pipelines with ScanPipe pipelines. This project is sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase/ Google Summer of Code, nexB and others generous sponsors!
https://scancodeio.readthedocs.io
Apache License 2.0
115 stars 85 forks source link

Android APK D2D: Convert classes.dex and related VM/Dalvik bytecode to plain JARs #1372

Closed pombredanne closed 1 month ago

pombredanne commented 2 months ago

The goal is to reverse the Android bytecode into Java class files and JARs amenable to further analysis.

There are multiple tools that may help with this:

JonoYang commented 1 month ago

I am able to use jadx to decompile the classes.dex file into java source files and then map the paths to the from codebase. However, we need to make jadx available to the user when they install scancode.io. We can do this by creating a new plugin that vendors jadx.

JonoYang commented 1 month ago

We should also have a way to separate the kotlin runtime from results

JonoYang commented 1 month ago

An android apk d2d pipeline has been created here: https://github.com/aboutcode-org/android-inspector

All android d2d related code for scancode.io will be placed here.

chinyeungli commented 1 month ago

This is a little sample d2d note when I compile an APK from kotlin sources

I created an APK in a sample android project that have 7 sources (excluing those xml from res/) which include 6 .kt files and 1.java file.

If I extract the APK, I find 5 classes.dex files, making it unclear which dex files contain the 'real' sources and which ones contain Google libraries.

On the other hand, if I use JADX directly on the APK file, it generates file structure without worrying the dex files.

This note indicates that we can use JADX directly on the APK file rather than the DEX file.

However, in either cases, it generated many more files that we originally have in the sources. It generated the following directories:

_COROUTINE/
android/support/v4/
andoridx/
com/example/ <-- source location
com/google/common/
kotlin/
kotlinx/
org/intellig/
org/jetbrains/

As note above, the source location should only be in com/example/, but it generates many other libs that we need to smartly ignore/identify when we perform the D2D in a sense that these will be no match.

For the file level, following are the source files in a sample andorid project:

empty_java.java
empty_kt.kt
MainActivity.kt
test_kt.kt

Following are the decompiled files generated from jadx:

ComposableSingletons$MainActivityKt.java
empty_java.java
empty_kotlin.java
MainActivity.java
MainActivityKt.java
R.java
test_kotlin.java
Test_ktKt.java

We already know all the R.java is generated, but for others, it is important to note that not all of them have matching basenames. These are the matching:

Source: empty_java.java JADX: empty_java.java

Source: empty_kt.kt JADX: empty_kotlin.java

Source: MainActivity.kt JADX: MainActivity.java, MainActivityKt.java

Source: test_kt.kt JADX: test_kotlin.java, Test_ktKt.java

We need to find a pattern in order to increase the matching accuracy and avoid noise.

JonoYang commented 1 month ago

I have created a library named android_inspector that provides a scancode.io pipeline that does android APK d2d. There is a step that calls jadx on .dex files.

pombredanne commented 1 month ago

@chinyeungli I am creating a new issue for the Kotlin-specific parts that were not part of this originally and this here is working and done. See for the follow up: