Open redouane-dziri opened 4 years ago
There have been many mentions of pycparser
to generate ASTs of C code. The issue mentioned somewhere else is that it only works for C code. A bunch of our data is C++. Also read mentions of AST Extracter gcc but same problem, works only for C.
Not willing to take on the technical debt of trying different parsers for different type of code as of yet. Trying to explore finding a one fits all solution first.
Clang
seemed like a promising candidate. Unfortunately documentation is sparse, so have to spend some time getting around the Python bindings, which still leaves a significant portion to code out oneself.
Others options welcome.
I think that with the new data source that I'm starting to explore, we might not even need to include C++ or C# anymore, we could honestly do with just C and still have a very large dataset. This also goes for @arnaudstiegler's .h problem (unless we actually want to also include .h and C++ code for the sake of generalization of our model).
This is more complicated than I thought. Running pycparser
's parser function on the string content of the ".c" files succeeds only on 54 of the strings out of 9,000+.
There are several issues that prevent it from completion. One of the biggest is the fact that the files often depend on header files and that this dependency (On parsing C, type declarations and fake headers) is lost in our treatment and gathering of the data.
The CParser
expects preprocessed code in its parse
method but preprocessors like gcc
, clang
or cpp
fail because they can't find the proper dependencies.
This is most likely not unique to pycparser
.
There are several issues that mention topics relating to this, nice to have a centralized place to discuss this, don't want to miss any updates or comments on this.