albertan017 / LLM4Decompile

Reverse Engineering: Decompiling Binary Code with Large Language Models
https://arxiv.org/abs/2403.05286
MIT License
3.22k stars 236 forks source link

Dataset #2

Closed henrycharlesworth closed 6 months ago

henrycharlesworth commented 8 months ago

Hi,

The paper mentions that the dataset is released, but unless I'm being really stupid I can't see it anywhere. Are you planning to release the training dataset anytime soon?

albertan017 commented 8 months ago

Thank you for your interest in our project! The dataset we've provided is meant for evaluation purposes. As for training materials, please refer to Anghabench which provides a substantial resource of 1 million compilable functions that we employ in training LLM4Decompile. We plan to share the script we use to create training data from AnghaBench shortly.