arthurherbout / crypto_code_detection

Automatic Detection of Custom Cryptographic C Code
8 stars 4 forks source link

Parsing C/C++ Code #36

Open redouane-dziri opened 4 years ago

redouane-dziri commented 4 years ago

There are several issues that mention topics relating to this, nice to have a centralized place to discuss this, don't want to miss any updates or comments on this.

redouane-dziri commented 4 years ago

There have been many mentions of pycparser to generate ASTs of C code. The issue mentioned somewhere else is that it only works for C code. A bunch of our data is C++. Also read mentions of AST Extracter gcc but same problem, works only for C.

Not willing to take on the technical debt of trying different parsers for different type of code as of yet. Trying to explore finding a one fits all solution first.

Clang seemed like a promising candidate. Unfortunately documentation is sparse, so have to spend some time getting around the Python bindings, which still leaves a significant portion to code out oneself.

Others options welcome.

corentinllorca commented 4 years ago

I think that with the new data source that I'm starting to explore, we might not even need to include C++ or C# anymore, we could honestly do with just C and still have a very large dataset. This also goes for @arnaudstiegler's .h problem (unless we actually want to also include .h and C++ code for the sake of generalization of our model).

redouane-dziri commented 4 years ago

This is more complicated than I thought. Running pycparser's parser function on the string content of the ".c" files succeeds only on 54 of the strings out of 9,000+. There are several issues that prevent it from completion. One of the biggest is the fact that the files often depend on header files and that this dependency (On parsing C, type declarations and fake headers) is lost in our treatment and gathering of the data. The CParser expects preprocessed code in its parse method but preprocessors like gcc, clang or cpp fail because they can't find the proper dependencies. This is most likely not unique to pycparser.