AISE-TUDelft / Capybara-BinT5

Replication package for the SANER 2023 paper titled "Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries"
11 stars 1 forks source link

Issue with Training #8

Closed wtegge2 closed 4 months ago

wtegge2 commented 4 months ago

I ran into an issue while trying to train the model. I run the command below to start the training: python3 run_exp.py --model_tag codet5_base --task summarize --sub_task C |& tee log.txt

The code throws an Assertion Error at line 162 in run_exp.py (images included). Upon inspection, I see that it calls the get_sub_tasks function in the same file, and it gets the sub_tasks based on the argument passed in. Since I am passing in "summarize", it gets the sub_tasks list below: sub_tasks = ['ruby', 'javascript', 'go', 'python', 'java', 'php']

I am confused because the setup instructions do not specify to create folders for training with these languages. The languages included in the setup are C, decomC, demiStripped, and strippedDecomC. I am unsure if this was done intentionally or not. Can you provide a solution to this error and let me know if this code is correct or not, please?

assertion_error assertion_error_in_code

aalkaswan commented 4 months ago

Hi, thanks for reaching out.

You should create a folder for each of the languages in the setup (I've fixed the broken command in the README), you can copy the files from the downloaded dataset into those folders. The fix for the other issue is simple, add the languages (C, decomC, etc) to the subtasks list. I've also updated this in the documentation.

Let me know if you need anything else 🙂