albertan017 / LLM4Decompile

Reverse Engineering: Decompiling Binary Code with Large Language Models
MIT License
2.97k stars 214 forks source link

Dataset #2

Closed henrycharlesworth closed 4 months ago

henrycharlesworth commented 6 months ago

Hi,

The paper mentions that the dataset is released, but unless I'm being really stupid I can't see it anywhere. Are you planning to release the training dataset anytime soon?

albertan017 commented 6 months ago

Thank you for your interest in our project! The dataset we've provided is meant for evaluation purposes. As for training materials, please refer to Anghabench which provides a substantial resource of 1 million compilable functions that we employ in training LLM4Decompile. We plan to share the script we use to create training data from AnghaBench shortly.