(2) Use C tokenizer/compiler to extract code features

arnaudstiegler commented 4 years ago

See #2

[x] Find compatible C code parser (that can be used with Python)
[x] Extract entity counts to generate features
[x] Explore feasibility of using AST as a feature (this is basically the next step a compiler would do, and it is based on the C tokens)

arnaudstiegler commented 4 years ago

@arthurherbout , FYI I am experimenting with pycparser that seems to do the job

arnaudstiegler commented 4 years ago

Good news is that it can process the AST for an entire file, and the AST is an absolute goldmine. However, we have to come up with ways to intelligently parse the information from the AST.

Side note: you need to remove the "docstring" at the top of the file to use the cparser.parse()

arnaudstiegler commented 4 years ago

Food for thought

From code2vec: Learning Distributed Representations of Code,

Following previous works [Alon et al. 2018; Raychev et al. 2015], we use paths in the program’s abstract syntax tree (AST) as our representation. By representing a code snippet using its syntactic paths, we can capture regularities that reflect common code patterns. We find that this representation significantly lowers the learning effort (compared to learning over program text), and is still scalable and general such that it can be applied to a wide range of problems and large amounts of code.

Seems like using ASTs rather than text is better because we can extract syntax (which would not be the case if we only take the textual code)

arthurherbout commented 4 years ago

I am going through that very interesting paper. It has pre-trained models. However it's for java code. I am looking to find whether there is an implementation for c code. If pycparser does the job, we can use some ideas on that paper because all they use is AST if I'm not mistaken. We do not need to go for attention mechanism right now but there must be some inspiration in that paper.

I'll continue on that

arthurherbout commented 4 years ago

src-d code2vec is a good repo on the same topic

arnaudstiegler commented 4 years ago

Actually, it is not that much because it runs on pyspark (which is a total overkill for 10k files), and they didn't finish it at all. I found something else that I think is better: https://github.com/vovak/astminer

arnaudstiegler commented 4 years ago

So I tried this package, it works (kinda) but it uses Gradle, so it is a black box, and it is difficult to reuse because the result is quite complex. Not sure it is feasible in this amount of time

arnaudstiegler commented 4 years ago

I found a list of interesting embeddings:

But none of them are for C code, and a lot of efforts is needed to reuse them....

arnaudstiegler commented 4 years ago

Because those AST techniques are not feasible in terms of time, I went for a very straightforward text-based approach: NLP approach by interpreting the code as plain text.

The pipeline is very straight-forward:

tokenizing the code: for now, we use the vanilla tokenizer from Keras (will be changed to a C tokenizer in the future, aside: pycparser is computing the AST but you can't access to the tokens without going through the entire AST so not ideal)
creating dictionnary id/token and computing sequences of ids from text

For the neural net, the approach is different to the usual pipeline: we use an embedding approach by starting the network with an embedding layer (that will be trained), and instead of processing the texts as sequences, we add a flatten layer before having a single unit layer with sigmoid activation.

model = Sequential() model.add(Embedding(NUM_WORDS, 50, input_length=MAX_LENGTH)) model.add(Flatten()) model.add(Dense(1, activation='sigmoid'))

compile the model

model.compile(optimizer='adam', loss='binarycrossentropy', metrics=['acc', f1,precision,recall_])

The result is that code files are turned into a sequence of embedded vectors that are directly fed to a simple logistic regression that is therefore trained by backpropagation (along with the embedding). The upside compared with a simple logistic regression is that we also get the embedding which gives some nice insights about the embedding and how tokens are seen by the model

arnaudstiegler commented 4 years ago

To get back on this issue, for the sake of the first report, I have used a super simple architecture, because it is fast, but there is a real question about the architecture. The sequences are so long, that everything runs pretty slow. It is not impossible that a feedforward net with only dense layers can give us the best results. However, if we wanna go into the more sequential territories, we can try more advanced architectures using:

1D CNNs
RNNs
even a Transformer could be a potential solution: it is very parallelizable so might actually be quicker and it is the current SOTA for natural language

Beside this, reusing Github/Code2Vec for learning a code encoding could be a good idea: training a seq2seq model that takes the code as input and predicts function name or docstring. But this requires different data since you need function definitions for that, and a much bigger dataset overall.

arthurherbout / crypto_code_detection

(2) Use C tokenizer/compiler to extract code features #17

compile the model