AISE-TUDelft / Capybara-BinT5

Replication package for the SANER 2023 paper titled "Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries"
10 stars 1 forks source link

cannot understand the README.md clearly #10

Open ljk419511 opened 1 week ago

ljk419511 commented 1 week ago
  1. Based on the following commands, I initially thought it was binding ~/Capybara-BinT5 to /data.

docker run -i -t --name {containerName} --gpus all -v $(pwd):/data aalkaswan/bint5 /bin/bash

docker exec -d {containerName} /bin/bash "/data/job.sh"

But in job.sh it says cd /data/CodeT5/sh/. I don't get it. Which directory exactly is docker run executed to generate the container?

  1. I'm very confused as to where exactly the following commands are executed. Inside the container or locally? Which directory should I cd?

First, clone the CodeT5 repo into this directory:

git clone https://github.com/salesforce/CodeT5.git

Run the following command to set the correct working directory in the training script:

wdir=\WORKDIR=\"pwd/'CodeT5'\" && sed '1 s#^.*$#'$wdir'#' CodeT5/sh/exp_with_args.sh

To use this data in BinT5, setup the data folders in the CodeT5 project:

mkdir -p CodeT5/data/summarize/{C,decomC,demiStripped,strippedDecomC}

I'm totally confused. Any help would be greatly appreciated.

aalkaswan commented 6 days ago

Hi,

  1. This just binds the current working directory to /data, the second command is to re-enter the container after you've left it. You should execute this command in the directory of this repo. The job.sh script needs to be executed in the container.
  2. These commands can be executed both outside the container (in your current directory) or inside the container (in /data which is bound to your current directory). But I recommend doing it outside the container as the container doesn't have git installed.

So in the end the directory tree should look something like this: Capybara-BinT5 (bound to \data) -> (job.sh, CodeT5) -> (data) -> (summarize) -> (C,decomC,demiStripped,strippedDecomC)

Please let me know if you need anything else,

-Ali

ljk419511 commented 5 days ago

Thank you very much for your reply. I still have some questions.

1.

git clone https://github.com/salesforce/CodeT5.git

After executing this command above, the working directory tree looks like the following.

a5b482945faba74843015a436445a67

I would like to ask if I need to cd CodeT5 before executing the following command

wdir=\WORKDIR=\"pwd/'CodeT5'\" && sed '1 s#^.*$#'$wdir'#' CodeT5/sh/exp_with_args.sh

Because if you don't, you'll get an error.

sed: can't read CodeT5/sh/exp_with_args.sh: No such file or directory

It seems to me that the command should be *`sed '1 s#^.$#'$wdir'#' CodeT5/CodeT5/sh/exp_with_args.sh`**.

I'm just trying to make sure if it was an unintentional mistake.

2. Still the same command.

wdir=\WORKDIR=\"pwd/'CodeT5'\" && sed '1 s#^.*$#'$wdir'#' CodeT5/sh/exp_with_args.sh

To modify the file, it seems necessary to use the -i option (--in-place) to tell the sed command to make the replacement directly in the source file. Which means,

sed -i '1 s#^.*$#'$wdir'#' CodeT5/sh/exp_with_args.sh

3.

In the downloaded CodeT5 repo change this line and add the languages to the subtask list. Finally, edit the language variable in the job.sh file and start training in detached mode:

I'm sorry. I'm still a little confused about what changes I should make. It would be nice if you could give me a few more hints.

aalkaswan commented 1 day ago

Hi,

  1. Sorry for the delay but I think I figured out the issue, it seems that the CodeT5 repo was updated. So the folder structure changed. I'll update the commands in the repo accordingly.

  2. Yes, you're correct, I've updated the command.

  3. So in the CodeT5/CodeT5/sh/run_exp.py file, you should change the line to include the data you just added:

sub_tasks = ['ruby', 'javascript', 'go', 'python', 'java', 'php', 'C', 'decomC', 'demiStripped', 'strippedDecomC']

Then in the run.sh file you can select the data you want to train in line 3, the one set in the script now is decomC.

To run it, I make sure you're out of the container again and run the following command: docker exec -it {containerName} /bin/bash

ljk419511 commented 15 hours ago

Sincerely thank you for your reply! Still some problems.

1. I seem to have caused a little misdirection. It should be wdir=\WORKDIR=\"pwd/'CodeT5/CodeT5'\" && sed -i '1 s#^.*$#'$wdir'#' CodeT5/CodeT5/sh/exp_with_args.sh

2. Screenshot from 2024-07-02 16-28-45

So the purpose of the following command is to use the data to finetune the CodeT5-base model to become BinT5 and then do some evaluation, is that the correct understanding?

docker exec -d {containerName} /bin/bash "/data/job.sh"

In that case, do I need to download the base model CodeT5-base from huggingface firstly? But I don't see you doing that.

If it does, I want to know where should I put this CodeT5-base model folder, or what parameter should I use to declare the location of the model. I didn't find a similar parameter.

3. I want to make sure that what is being downloaded here is the model that has been fine-tuned, which is BinT5 in Fig6 above, right?

Similarly to download the pretrained BinT5 checkpoints:

wget https://zenodo.org/records/7229913/files/BinT5.zip?download=1 unzip BinT5.zip rm Capybara.zip

So here we can use the downloaded BinT5 model for other operations such as inference or further training according to the following quote. I don't know if I understand it correctly.

Select the model that you wish to use from the respective directory. Copy this file and replace the in the local directory downloaded in the previous step.pytorch_model.bincodet5-base

Any help would be greatly appreciated!