! NOT IN ACTIVE DEVELOPMENT !

Update Note : This script works best in Linux, the binaries used for internal process like JADX is ELF, and the OS specific commands are for 'nix systems. But it's also easy to run on windows, you might need to tweak ~10-15% of this code. If you do so and get this working on windows, I would really appreciate if you can share with everyone here, feel free to create a PR ^_^. I am planning to do it as many people are in windows env. and they are facing problems, but I will do it when I get time or a vacation from my current academics. :-)

Android Permission Extraction and Dataset Creation with Python

About:

This script will extract permission information from Malware and Benign applications in their respective folders and then create one Comma Seperated Values (.csv) file to store them in one place ready to be fed into ML algorithms.

How to use ?

Just copy your Malware and Benign applications on which you want to train your ML Model and run the script by following command in terminal.

python3 ExtractorAIO.py

The script will do the rest.

This can take several minutes depending on the size and number of your APK files.

How to use generated data?

The generated data will be in .csv format and can be parsed with the help of many prebuild libraries or modules.

pandas module in python is suggested

Formatting

The data is formatted in following way -

NAME	android.permission.ACCESS_NETWORK_STATE	android.permission.CALL_PHONE	android.permission.READ_PHONE_STATE	android.permission.WRITE_EXTERNAL_STORAGE	CLASS
a.SurlyProjectFinal.apk	1	1	1	1	0
ae.gov.dha.dha.apk	1	0	1	1	0
aero.zztrop.apk	0	0	0	0	0
a5starapps.com.drkalamquotes.apk	1	0	0	0	1
ackman.placemarks.apk	1	0	0	1	1
ackmaniac.currencyfxrates.apk	1	0	0	0	1

This is sample dataset of 6 applications (3 Malware & 3 Benign)

With 1000s of samples the table can be too big for general Office tools to open it.

The 1st column contains name of respective application and last column "CLASS" contains information if the application if from benign or malware family of training set. [0=Benign, 1=Malware] In between there are all the permissions (common + all found in 1st phase) with respective information bit, [0=The applicaion do not use this permission, 1=This permission is used in the application]

Importing Data Example in SKlearn

Following is an example to import the data from the generated dataset into your sklearn RandomForest Model.

file = pd.read_csv("data.csv")
coulmnNames = file.iloc[1:1, 1:].columns
FeatureNames = list(coulmnNames[1:-1])
LabelName = coulmnNames[-1]
X = file[FeatureNames]
X = np.asarray(X)
Y = file[LabelName]
Y = np.asarray(Y)
feature_vectors = X
 labels = Y
 train_x, test_x, train_y, test_y = train_test_split(feature_vectors,labels,test_size=0.2)

The above code will remove NAME column and then store FEATURE_MATRIX (from column after NAME to second last column) and LABEL_VECTOR ( CLASS column) in X and Y respectively, which later can be split into desired training and testing sets.

This is used in PACE project.
This can be used to Reproduce the work in
A. Kumar, V. Agarwal, S. K. Shandilya, A. Shalaginov, S. Upadhyay and B. Yadav, "PACE: Platform for Android Malware Classification and Performance Evaluation," 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 2019, pp. 4280-4288, doi: 10.1109/BigData47090.2019.9006557.
- Abstract: Android malware has become the topmost threat for ubiquitous and useful Android eco-system. Multiple solutions leveraging big data and machine learning capabilities to detect android malware are being constantly developed. Too often, many of these solutions are either limited to the research output or remain isolated and unable to reach to end-users or malware researchers. In this paper, we propose, PACE, a unified solution to offer open and easy implementation access to several machine learning-based Android malware detection techniques that make most of the research in this domain reproducible. The benefits of PACE are offered using three interfaces i.e. through REST API, Web Interface and ADB interface. Multiple interfaces enable users with different expertise such as IT administrator, security practitioners, malware researcher, etc. to avail its offered services. A community-accepted dataset is used for testing of all the techniques to provide a better comparison of performance. A prototype of the proposed platform is introduced and our vision is that it will help malware analysts to tackle challenges and reduce the amount of manual work. keywords: {Android (operating system);Big Data;invasive software;learning (artificial intelligence);pattern classification;software performance evaluation;big data;malware analysts;Android malware classification;performance evaluation;PACE;machine learning-based Android malware detection;Malware;Androids;Humanoid robots;Feature extraction;Smart phones;Machine learning;Security;Android Malware;Reproducible Research;Machine Learning;Cyber Threat Intelligence},URL: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9006557&isnumber=9005444

Please Cite above paper if you are using this tool :

@INPROCEEDINGS{9006557, author={A. {Kumar} and V. {Agarwal} and S. K. {Shandilya} and A. {Shalaginov} and S. {Upadhyay} and B. {Yadav}}, booktitle={2019 IEEE International Conference on Big Data (Big Data)}, title={PACE: Platform for Android Malware Classification and Performance Evaluation}, year={2019}, volume={}, number={}, pages={4280-4288},}

=== Extra Reading ===

Kumar, Ajit, K. S. Kuppusamy, and G. Aghila. "FAMOUS: Forensic Analysis of MObile devices Using Scoring of application permissions." Future Generation Computer Systems 83 (2018): 158-172.

Saket-Upadhyay / Android-Permission-Extraction-and-Dataset-Creation-with-Python

readme