Update Note : This script works best in Linux, the binaries used for internal process like JADX is ELF, and the OS specific commands are for 'nix systems. But it's also easy to run on windows, you might need to tweak ~10-15% of this code. If you do so and get this working on windows, I would really appreciate if you can share with everyone here, feel free to create a PR ^_^. I am planning to do it as many people are in windows env. and they are facing problems, but I will do it when I get time or a vacation from my current academics. :-)
This script will extract permission information from Malware and Benign applications in their respective folders and then create one Comma Seperated Values (.csv) file to store them in one place ready to be fed into ML algorithms.
Just copy your Malware and Benign applications on which you want to train your ML Model and run the script by following command in terminal.
python3 ExtractorAIO.py
The script will do the rest.
This can take several minutes depending on the size and number of your APK files.
The generated data will be in .csv format and can be parsed with the help of many prebuild libraries or modules.
pandas module in python is suggested
The data is formatted in following way -
This is sample dataset of 6 applications (3 Malware & 3 Benign)
With 1000s of samples the table can be too big for general Office tools to open it.
Following is an example to import the data from the generated dataset into your sklearn RandomForest Model.
file = pd.read_csv("data.csv")
coulmnNames = file.iloc[1:1, 1:].columns
FeatureNames = list(coulmnNames[1:-1])
LabelName = coulmnNames[-1]
X = file[FeatureNames]
X = np.asarray(X)
Y = file[LabelName]
Y = np.asarray(Y)
feature_vectors = X
labels = Y
train_x, test_x, train_y, test_y = train_test_split(feature_vectors,labels,test_size=0.2)
The above code will remove NAME column and then store FEATURE_MATRIX (from column after NAME to second last column) and LABEL_VECTOR ( CLASS column) in X and Y respectively, which later can be split into desired training and testing sets.
This is used in PACE project.
This can be used to Reproduce the work in
A. Kumar, V. Agarwal, S. K. Shandilya, A. Shalaginov, S. Upadhyay and B. Yadav, "PACE: Platform for Android Malware Classification and Performance Evaluation," 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 2019, pp. 4280-4288, doi: 10.1109/BigData47090.2019.9006557.
@INPROCEEDINGS{9006557, author={A. {Kumar} and V. {Agarwal} and S. K. {Shandilya} and A. {Shalaginov} and S. {Upadhyay} and B. {Yadav}}, booktitle={2019 IEEE International Conference on Big Data (Big Data)}, title={PACE: Platform for Android Malware Classification and Performance Evaluation}, year={2019}, volume={}, number={}, pages={4280-4288},}
=== Extra Reading ===