Closed 5c4lar closed 4 months ago
2024.5.10 Update: All the evaluations and models are based on executable! enjoy~
Thanks for your interest for our project! Indeed, we're utilizing object files instead of executables, as our training material from Anghabench supports only compilation without linkage. At present, we're examining exebench and also gathering our own datasets to generate executable files for training an practical LLM4Decomple.
The process of gathering data and developing a workable approach for decompiling complex files with multiple functions is quite demanding. Therefore, this initial version of LLM4Decompile is limited to decompilation of individual functions.
Addressing the complexities posed by external functions and type definitions is a primary focus of our future decompilation efforts. Our team is actively working on strategies to address these issues. While the nature of the problem maybe ill-posed, a larger and more varied training dataset will allow the model to make statistical guesses about the potential functions and types that correspond to the missing pieces. We'll report the results asap!
Also, we recommend to try these projects:
They're fascinating and very powerful!
Upon thorough examination, it has come to my attention that the dataset's integrity might be compromised due to the methodology employed in generating the assembly representations. Specifically, the use of object files instead of fully linked binaries introduces inaccuracies, particularly concerning external function calls and the handling of immediate values.
The absence of the linking process results in disassemblies where immediate numbers for external function calls are left blank, leading to misleading representations. Each call to an external function is disassembled to call the next instruction, which can severely impact the model's ability to distinguish between different external function calls.
For example, in your
decompile-eval.json
, line:294, task 10, O1, the function with strlen, malloc and strncpy results in using the following disassembly as the input, thosecallq
s do not point to the correct location. Even state of the art decompilers cannot decompile those assembly (when the object files are stripped and correct values are not filled into those calls).This discrepancy raises concerns about the reliability and effectiveness of the language models trained on such data. Inaccurate representations could potentially undermine the model's ability to generalize and produce meaningful decompiled C functions.