Saket-Upadhyay / Android-Permission-Extraction-and-Dataset-Creation-with-Python

One script to create a permission-based dataset of android applications for your next ML Malware Detection gizmo.
MIT License
17 stars 11 forks source link

! NOT IN ACTIVE DEVELOPMENT !

Update Note : This script works best in Linux, the binaries used for internal process like JADX is ELF, and the OS specific commands are for 'nix systems. But it's also easy to run on windows, you might need to tweak ~10-15% of this code. If you do so and get this working on windows, I would really appreciate if you can share with everyone here, feel free to create a PR ^_^. I am planning to do it as many people are in windows env. and they are facing problems, but I will do it when I get time or a vacation from my current academics. :-)

Android Permission Extraction and Dataset Creation with Python

About:

This script will extract permission information from Malware and Benign applications in their respective folders and then create one Comma Seperated Values (.csv) file to store them in one place ready to be fed into ML algorithms.

How to use ?

Just copy your Malware and Benign applications on which you want to train your ML Model and run the script by following command in terminal.

python3 ExtractorAIO.py

The script will do the rest.

This can take several minutes depending on the size and number of your APK files.

How to use generated data?

The generated data will be in .csv format and can be parsed with the help of many prebuild libraries or modules.

pandas module in python is suggested

Formatting

The data is formatted in following way -

NAMEandroid.permission.ACCESS_KEYGUARD_SECURE_STORAGEandroid.permission.ACCESS_NETWORK_STATEandroid.permission.CALL_PHONEandroid.permission.READ_PHONE_STATEandroid.permission.WRITE_EXTERNAL_STORAGECLASS
a.SurlyProjectFinal.apk011110
ae.gov.dha.dha.apk010110
aero.zztrop.apk000000
a5starapps.com.drkalamquotes.apk010001
ackman.placemarks.apk010011
ackmaniac.currencyfxrates.apk010001

This is sample dataset of 6 applications (3 Malware & 3 Benign)

With 1000s of samples the table can be too big for general Office tools to open it.

The 1st column contains name of respective application and last column "CLASS" contains information if the application if from benign or malware family of training set. [0=Benign, 1=Malware] In between there are all the permissions (common + all found in 1st phase) with respective information bit, [0=The applicaion do not use this permission, 1=This permission is used in the application]

Importing Data Example in SKlearn

Following is an example to import the data from the generated dataset into your sklearn RandomForest Model.

file = pd.read_csv("data.csv")
coulmnNames = file.iloc[1:1, 1:].columns
FeatureNames = list(coulmnNames[1:-1])
LabelName = coulmnNames[-1]
X = file[FeatureNames]
X = np.asarray(X)
Y = file[LabelName]
Y = np.asarray(Y)
feature_vectors = X
 labels = Y
 train_x, test_x, train_y, test_y = train_test_split(feature_vectors,labels,test_size=0.2)

The above code will remove NAME column and then store FEATURE_MATRIX (from column after NAME to second last column) and LABEL_VECTOR ( CLASS column) in X and Y respectively, which later can be split into desired training and testing sets.

More

Please Cite above paper if you are using this tool :

@INPROCEEDINGS{9006557, author={A. {Kumar} and V. {Agarwal} and S. K. {Shandilya} and A. {Shalaginov} and S. {Upadhyay} and B. {Yadav}}, booktitle={2019 IEEE International Conference on Big Data (Big Data)}, title={PACE: Platform for Android Malware Classification and Performance Evaluation}, year={2019}, volume={}, number={}, pages={4280-4288},}



=== Extra Reading ===