A question on vectorizer using word2vec

crestonbunch / tbcnn

Efficient tree-based convolutional neural networks in TensorFlow

MIT License

150 stars 44 forks source link

Hey, thanks for this awesome implementation, this is exactly what i'm looking for since the details of the paper is not trivial to understand.

In the vectorizer part, you adopt word2vec technique to train the embedding for the AST, that's great. But I don't understand the intuition behind this, is there any reference?.

In word2vec, the embedding look up serves as a look up table, and the input is a one-hot encoding vector, if we multiply the one-hot encoding input with the embedding matrix, it will effectively just select the matrix row corresponding to the "1" in the input.

But in this case, seems not the same, after learning the embeddings, you save the embeddings along with NODE_MAP((the dictionary to store index of token in your implementation) into the pickle. how can we know that the index of the vector in the embedding table will match with the index in the NODE_MAP?

Yes, the original paper uses the paper Building Program Vector Representations for Deep Learning to embed the AST node into a feature vector. This approach is quite similar to the word2vec, where the contextual information is the children in the case of AST. The source code of this implementation is found here. Looking at the code (it is a bit hard to understand...), it seems that for each AST, they build a new neural network (NN) with the same parameter W and b (for example, the NN of a AST of 2 and 3 levels will be different in terms of forward pass, but they have the same W and b).

crestonbunch / tbcnn

A question on vectorizer using word2vec #5