Open KagamiBaka opened 3 years ago
I have the same question, but I presume that you should tokenize a code into a token sequence and then use the sequence to train Word2Vec model
Each line becomes a 'doc' in a 'corpus'. If you use a AST from joern you can benefit from their parsing.
if i want to use train a new Word2Vec for a new dataset,which files should i use,in what order?