crestonbunch / tbcnn

Efficient tree-based convolutional neural networks in TensorFlow
MIT License
150 stars 44 forks source link

A question on vectorizer using word2vec #5

Open bdqnghi opened 7 years ago

bdqnghi commented 7 years ago

Hey, thanks for this awesome implementation, this is exactly what i'm looking for since the details of the paper is not trivial to understand.

In the vectorizer part, you adopt word2vec technique to train the embedding for the AST, that's great. But I don't understand the intuition behind this, is there any reference?.

In word2vec, the embedding look up serves as a look up table, and the input is a one-hot encoding vector, if we multiply the one-hot encoding input with the embedding matrix, it will effectively just select the matrix row corresponding to the "1" in the input.

But in this case, seems not the same, after learning the embeddings, you save the embeddings along with NODE_MAP((the dictionary to store index of token in your implementation) into the pickle. how can we know that the index of the vector in the embedding table will match with the index in the NODE_MAP?

lolongcovas commented 6 years ago

Yes, the original paper uses the paper Building Program Vector Representations for Deep Learning to embed the AST node into a feature vector. This approach is quite similar to the word2vec, where the contextual information is the children in the case of AST. The source code of this implementation is found here. Looking at the code (it is a bit hard to understand...), it seems that for each AST, they build a new neural network (NN) with the same parameter W and b (for example, the NN of a AST of 2 and 3 levels will be different in terms of forward pass, but they have the same W and b).