basedrhys / obfuscated-code2vec

Code for the paper "Embedding Java Classes with code2vec: Improvements from Variable Obfuscation" in MSR 2020
MIT License
31 stars 7 forks source link

Embedding Java Classes with code2vec: Improvements from Variable Obfuscation

Overall project view

This repository contains the java obfuscation tool created with Spoon and the dataset pipeline as described in:

Rhys Compton, Eibe Frank, Panos Patros, and Abigail Koay - Embedding Java Classes with code2vec: Improvements from Variable Obfuscation, in MSR '20 [ArXiv Preprint]

Also included are all models and data used in the paper for reproducing/further research.

Table of Contents

Downloadable Assets

Requirements

Usage: Obfuscator

  1. cd java-obfuscator
  2. Locate a folder of .java files (e.g., from the code2seq repository)
  3. Alter the input and output directories in obfs-script.sh, as well as the number of threads of your machine. If you're running this on a particularly large folder (e.g., millions of files) then you may need to increase the NUM_PARTITIONS to 3 or 4, otherwise memory issues can occur, grinding the obfuscator to a near halt.
  4. Run obfs-script.sh i.e. $ source obfs-script.sh

This will result in a new obfuscated folder of .java files, that can be used to train a new obfuscated code2vec model (or any model that performs learning from source code for that matter).

Usage: Dataset Pipeline

Dataset Pipeline View

The pipeline uses a trained code2vec model as a feature extractor, converting a classification dataset of .java files into a numerical form (.arff by default), that can then be used as input for any standard classifier.

All of the model-related code (common.py, model.py, PathContextReader.py) as well as the JavaExtractor folder is code from the original code2vec repository. This was used for invoking the trained code2vec models to create method embeddings - using the code2vec model as a feature extractor.

The dataset should be in the form of those supplied with this paper i.e.:

dataset_name
|-- class1
    |-- file1.java
    |-- file2.java
    ...
|-- class2
    |-- file251.java
    |-- file252.java
    ...

...

To run the dataset pipeline and create class-level embeddings for a dataset of Java files:

  1. cd pipeline
  2. pip install -r requirements.txt
  3. Download a .java dataset (from the datasets supplied or your own) and put in the java_files/ directory
  4. Download a code2vec model checkpoint and put the checkpoint folder in the models/ directory
  5. Change the paths and definitions in model_defs.py and number of models in scripts/create_datasets.sh to match your setup
  6. Run create_datasets.sh (source scripts/create_datasets.sh). This will loop through each model and create class-level embeddings for the supplied datasets. The resulting datasets will be in .arff format in the weka_files/ folder.

You can now perform class-level classification on the dataset using any off-the-shelf WEKA classifier. Note that the dataset contains the original filename as a string attribute for debugging purposes; you'll likely need to remove this attribute before you pass the dataset into a classifier.

Config

By default the pipeline will use the full range of values for each parameter, which creates a huge number of resulting .arff datasets (>1000). To reduce the number of these, remove (or comment out) some of the items in the arrays in reduction_methods.py and selection_methods.py (at the end of the file). Our experiments showed that the SelectAll selection method and NoReduction reduction method performed best in most cases so you may want to just keep these.

Trained code2vec Models

The models are all available for download: Zenodo Link.

The .java datasets used to train each of the models (different versions of java-large from the code2seq repository), as well as the preprocessed code2vec-ready versions of those datasets are also available: Google Drive Link

Datasets

The .java datasets collated for this research are all available for download: Zenodo Link.

For the interactive embedding visualisation links below, best results are often seen by UMAP.

Class distributions shown below generated by WEKA

OpenCV/Spring

2 categories, 305 instances

Class Distribution

Embedding Visualisation

OpenCV/Spring Visualisation

Algorithm Classification

7 categories, 182 instances

Class Distribution

Embedding Visualisation

Algorithm Classification Visualisation

Code Author Attribution

13 categories, 1062 instances

Class Distribution

Embedding Visualisation

Algorithm Classification Visualisation

Bug Detection

2 categories, 31135 instances*

Class Distribution

Duplicate File Detection

2 categories, 1669 instances

Class Distribution

Duplicate Function Detection

2 categories, 1277 instances

Class Distribution

Malware Classification

Can't share dataset for security reasons, however, you can request it from the original authors: http://amd.arguslab.org/

3 categories, 20927 instances*

Class Distribution

Notes

* - 2000 samples per class were randomly sampled during experiments, so the results in the paper are reported on a smaller dataset. The downloadable dataset is the full version.

Citation

Embedding Java Classes with code2vec: Improvements from Variable Obfuscation

@inproceedings{10.1145/3379597.3387445,
author = {Compton, Rhys and Frank, Eibe and Patros, Panos and Koay, Abigail},
title = {Embedding Java Classes with Code2vec: Improvements from Variable Obfuscation},
year = {2020},
isbn = {9781450375177},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3379597.3387445},
doi = {10.1145/3379597.3387445},
booktitle = {Proceedings of the 17th International Conference on Mining Software Repositories},
pages = {243–253},
numpages = {11},
keywords = {machine learning, code obfuscation, neural networks, code2vec, source code},
location = {Seoul, Republic of Korea},
series = {MSR '20}
}