albertan017 / LLM4Decompile

Reverse Engineering: Decompiling Binary Code with Large Language Models
https://arxiv.org/abs/2403.05286
MIT License
3.22k stars 236 forks source link

I wonder if you could share some experience on colllecting dataset #26

Open Pisces032 opened 2 months ago

Pisces032 commented 2 months ago

I'm trying to peft it. And I have got some dataset, but they either too small or having too many headers to install. The install commands of different headers differ greatly. So I wonder if you have any advice on how to find suitable datasets like AnghaBench. Thank you so much!

albertan017 commented 2 months ago

We've only found AnghaBench and Exebench, which cover nearly all available C libraries. If you have specific requirements, you might need to manually compile larger projects like Linux. While it's time-consuming, this approach can be beneficial for improving the model further, and that's what we're doing now.