Ye, Xin; Shen, Hui; Ma, Xiao; Bunescu, Razvan; Liu, Chang
Summary
In this paper, they try to bridge the lexical gap between bug reports and source codes using word embedding. They trained a shared word embedding consists of both NL words and code tokens. The embedding is generated based on API docs, tutorials, and reference docs. One interesting step in their tokenization is after splitting the code token CodeToken into code and token, they also kept CodeToken in the vocab. After converting a document into corresponding embedding vectors, they generate the similarity between two docs based on the following formulas.
Finally, they empirically showed that incorporating word embedding can significantly improve the performance of baseline models. An interesting takeaway is word embedding alone cannot perform better than IR. However, a combination of WE and IR can outperform raw IR.
Contributions of The Paper
Adapting skip-gram to generate shared vector space for code tokens and NL words
A method for computing similarity between queries and documents based on word embedding
Extensive empirical experiment to evaluate the utility of the proposed technique
Publisher
Proceedings - International Conference on Software Engineering
Link to The Paper
https://doi.org/10.1145/2884781.2884862
Name of The Authors
Ye, Xin; Shen, Hui; Ma, Xiao; Bunescu, Razvan; Liu, Chang
Summary
In this paper, they try to bridge the lexical gap between bug reports and source codes using word embedding. They trained a shared word embedding consists of both NL words and code tokens. The embedding is generated based on API docs, tutorials, and reference docs. One interesting step in their tokenization is after splitting the code token
CodeToken
into code and token, they also kept CodeToken in the vocab. After converting a document into corresponding embedding vectors, they generate the similarity between two docs based on the following formulas.Finally, they empirically showed that incorporating word embedding can significantly improve the performance of baseline models. An interesting takeaway is word embedding alone cannot perform better than IR. However, a combination of WE and IR can outperform raw IR.
Contributions of The Paper
Comments