This repository contains the code, the dataset and additional technical information for our USENIX Security '22 paper:
Andrea Marcelli, Mariano Graziano, Xabier Ugarte-Pedrero, Yanick Fratantonio, Mohamad Mansouri, Davide Balzarotti. How Machine Learning Is Solving the Binary Function Similarity Problem. USENIX Security '22.
The paper is available at this link.
The technical report, with additional information on the dataset and the selected approaches, is available at this link.
The repository is structured in the following way:
The following is a list of the main steps to follow based on the most common use cases:
Reproduce the experiments presented in the paper
Test a new approach on our datasets
Use one of the existing approaches to infer new functions
Please use the following BibTeX:
@inproceedings {280046,
author = {Andrea Marcelli and Mariano Graziano and Xabier Ugarte-Pedrero and Yanick Fratantonio and Mohamad Mansouri and Davide Balzarotti},
title = {How Machine Learning Is Solving the Binary Function Similarity Problem},
booktitle = {31st USENIX Security Symposium (USENIX Security 22)},
year = {2022},
isbn = {978-1-939133-31-1},
address = {Boston, MA},
pages = {2099--2116},
url = {https://www.usenix.org/conference/usenixsecurity22/presentation/marcelli},
publisher = {USENIX Association},
month = aug,
}
Our corrections to the published paper:
The code in this repository is licensed under the MIT License, however some models and scripts depend on or pull in code that have different licenses.
Binaries/LICENSES
directory.Asm2vec
and Doc2vec
models are implemented on top of the Gensim project which is released under LGPL-2.1.IDA_codeCMR.py
plugin is released under GPL v3.catalog1
folder contains the source code of the Catalog1 library which is licensed under GPL v3.FunctionSimSearch
model pulls the code from the FunctionSimSearch project which is released under Apache License 2.0.SAFE
model contains part of the original source code of SAFE which is licensed under GPL v3.GGSNN
and GMN
models contain part of the original source code which is licensed under Apache License 2.0.GNN-s2v
models contains part of the original source code which is licensed under CC BY-NC-SA 4.0.For help or issues, please submit a GitHub issue.