AISE-TUDelft / Capybara-BinT5

Replication package for the SANER 2023 paper titled "Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries"
13 stars 1 forks source link

Capybara-BinT5

Static Badge Static Badge Docker Image Size (tag) arXiv License

Replication package for the SANER 2023 paper titled "Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries".

For questions about the content of this repo, please use the issues board. If you have any questions about the paper, please email the first author.

HuggingFace 🤗

The models and dataset are both also available on the HF Hub.

To replicate the experimental setup of the paper follow the following steps:

Docker Image

It is recommended to use the provided Docker image, which has the correct Cuda version and all of the required dependencies installed. Pull the image, create a container, and mount this folder as a volume:

docker pull aalkaswan/bint5
docker run -i -t --name {containerName} --gpus all -v $(pwd):/data aalkaswan/bint5 /bin/bash

This should spawn a shell, which allows you to use the container. Change to the mounted volume:

cd /data/

All of the following commands should then be run from within the Docker container. You can respawn the shell using:

docker exec -it {containerName} /bin/bash

If you wish to run without using docker, we also provide a requirements.txt file.

Setup

First, clone the CodeT5 repo into this directory:

git clone https://github.com/salesforce/CodeT5.git

Run the following command to set the correct working directory in the training script:

wdir=\WORKDIR=\"`pwd`/'CodeT5/CodeT5'\" && sed -i '1 s#^.*$#'$wdir'#' CodeT5/CodeT5/sh/exp_with_args.sh

Now that the model is set up we need to download the data, use the following commands to download and unpack the data:

wget https://zenodo.org/record/7229809/files/Capybara.zip
unzip Capybara.zip
rm Capybara.zip

Similarly to download the pretrained BinT5 checkpoints:

wget https://zenodo.org/records/7229913/files/BinT5.zip?download=1
unzip BinT5.zip
rm Capybara.zip

Finetune Models

To use this data in BinT5, setup the data folders in the CodeT5 project:

mkdir -p CodeT5/CodeT5/data/summarize/{C,decomC,demiStripped,strippedDecomC}

Now you can simply move the data of your choice from \Capybara\training_data\{lan}\{dup/dedup} to CodeT5\data\summarize\{lan}. In the downloaded CodeT5 repo change this line and add the languages to the subtask list. Finally, edit the language variable in the job.sh file and start training in detached mode:

docker exec -d {containerName} /bin/bash "/data/job.sh"

You can view the progress and results of the finetuning in the: \CodeT5\sh\log.txt file, the resulting model and training outputs are also present in the same folder.

Use Finetuned BinT5 Checkpoints

For each of the models, a pytorch.bin file is provided in its respective folder. These models can be loaded into CodeT5 and used for inference or further training.

To utilise the models, download the reference CodeT5-base model from HuggingFace: bash

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Salesforce/codet5-base