Graph embedding for semantic features ?

Hadrien-Cornier commented 4 years ago

I read Code2vec very closely. Code2vec is really amazing for the problem that we have and it's pretty obvious that we could just implement it and improve our current model.

However, there is clearly potential to improve current research on code embedding but doing relatively simple improvements to the model, in my opinion. First, there is clearly the fact that the attention model is linear, and perhaps adding a little nonlinearity with a simple hidden layer could improve the results.

Additionally, it seems that the path subsampling+attention mechanism is just like a DeepWalk graph embedding, but worse (?) in some ways (it cuts long paths, and does not generalize well to different function names, they talk about the inherent sparsity of their data in the last pages).

Also, it seems to me that this graph embedding technique could be computationally inefficient compared to more mathematical tools like spectral graph theory . Perhaps training could be a lot faster if instead of using an attention mechanism with a lot of parameters we simply had a PCA to do.

There is a nice extractor of ASTs (ASGs would theoretically be better but I could not find a tool for their extraction, maybe someone else in the team will find it).

I will definitely look into this more, I think this is pretty interesting!

arnaudstiegler commented 4 years ago

@Hadrien-Cornier,

In terms of performance, I think the main bottleneck is not the algo we choose, but the fact that we have very long code files (which leads to huge ASTs). Maybe switching to function definitions could be our best option.

I don't think you can compare an attention layer and PCA. PCA is just a dimension reduction technique, whereas an attention-layer allows for contextual embedding and will improve your performance. You get some non-linearities with an attention layer (the weight for the value vector is softmax(query_vector*key_vector). I have seen a lot of architectures where people prefer to stack attention layers (basically a Transformer architecture) and do an average pooling for the resulting vector rather than having dense layers after the attention mechanisms (I'm guessing because that allows for fewer parameters while not losing a lot of performance).

For the tree extractor, the pycparser package will do the job and works fine (and it is a Python package, not a bash script). See #17, there are a few pre-implemented packages for computing the paths of an AST, but non of them are exactly what we need. So we might have to code something ourselves for that!

Hadrien-Cornier commented 4 years ago

Instead of PCA, I was thinking more generally about linear algebra used for graph representation. In this case, the vector associated to the non zero eigenvalue of the Laplacian matrix of the tree can give you the direction of the optimal embedding for your graph. In this sense, The embedding of the graph is parameter-free as opposed to an attention mechanism. Instead of having to optimize a cost function with respect to many weights and biases, one could do a non-learnable operation of linear algebra on the graph, and perhaps bring the learning back to the graph itself, where we could learn an embedding of the AST's nodes.

Also the attention layer does not give us nonlinearities if we are preoccupied with a classification task instead of a regression one. Indeed, if simply having a sigmoid function at the end of a neural network meant having "non-linearities" in a classification task, then the SVM classifier would be non-linear. Since the decision boundaries are linear, even though the output of the network is not linear with respect to the input, the model is "linear".

The transformer architecture seems very interesting, perhaps we could try something similar.

Hadrien-Cornier commented 4 years ago

Seen on https://github.com/eliben/pycparser/wiki/FAQ :

What about parsing C++? I have no intentions of expanding pycparser to support C++. If you need to parse C++ with Python, I suggest using Clang with its Python bindings. See this article for more details

So I will look into Clang

arthurherbout / crypto_code_detection

Graph embedding for semantic features ? #24