albertan017 / LLM4Decompile

Reverse Engineering: Decompiling Binary Code with Large Language Models
MIT License
2.97k stars 214 forks source link

Do you take `struct` into consideration? #8

Open XinyuShe opened 6 months ago

XinyuShe commented 6 months ago

Do you take structinto consideration? And how do you handle the issue of excessively long functions in assembly code?

albertan017 commented 6 months ago

No, currently we only consider a single function.

Gathering data and developing a workable approach for decompiling complex files with multiple functions and structures is quite demanding. Therefore, this initial version of LLM4Decompile is limited to decompilation of individual functions.

Addressing the complexities posed by external functions and struct definitions is a primary focus of our future decompilation efforts. Our team is actively working on strategies to address these issues. While the nature of the problem maybe ill-posed, a larger and more varied training dataset will allow the model to make statistical guesses about the potential functions and types that correspond to the missing pieces. We'll report the results asap!

XinyuShe commented 6 months ago

@albertan017 Thanks for your reply! I am also wondering where did you find those c file datasets without structs and long function?

albertan017 commented 6 months ago

@albertan017 Thanks for your reply! I am also wondering where did you find those c file datasets without structs and long function?

We remove those parts in Anghabench for simplification. The original dataset is available here. But the dataset is only compilable, not linkable. Therefore, we are looking for other benchmarks and collecting our own data.