arthurherbout / crypto_code_detection

Automatic Detection of Custom Cryptographic C Code
8 stars 4 forks source link

Continue literary review #28

Open redouane-dziri opened 4 years ago

redouane-dziri commented 4 years ago

We should all keep reading on what other people are doing in similar problems and link articles here, with fresh ideas.

Hoping to get Yorgos' Deep Learning references sometime soon to get cracking on that front if it rocks anyone's boat :)

arthurherbout commented 4 years ago

I have read papers on Co-Clustering. Co-Clustering is a field that tries to cluster unlabeled data but also the features used by the data point. A good example is Text Documents: each document is composed of words. The idea is to cluster some documents together WITH some features. If we see that problem as a Bipartite Graph then it is a partitioning of the bipartite graph with a minimum cut.

Here are the papers I have read:

I have implemented the first two, but that will go on another issue.

The last paper mentioned is really interesting since it creates a new graph with exactly k connected components that will be our k clusters. It is a very beautiful article.

The first two introduce very well the linear algebra of graphs, and especially bipartite ones.

Hadrien-Cornier commented 4 years ago

Probably a must-read for everyone in the team :

A LITERATURE STUDY OF EMBEDDINGS ON SOURCE CODE

https://arxiv.org/pdf/1904.03061.pdf

Will comment later with my thoughts

Hadrien-Cornier commented 4 years ago

Github implementations of code embeddings that work for C/C++ that stand out from the review :

  1. Using word2vec on paths from the the control flow obtained by LLVM :

https://github.com/defreez-ucd/func2vec-fse2018-artifact

  1. Using Abstract Symbolic Traces : https://github.com/jjhenkel/code-vectors-artifact

3.Using RNN on Contextual Flow Graphs : https://github.com/spcl/ncc