Paper Review: From word embeddings to document similarities for improved information retrieval in software engineering

Publisher

Proceedings - International Conference on Software Engineering

Link to The Paper

Name of The Authors

Ye, Xin; Shen, Hui; Ma, Xiao; Bunescu, Razvan; Liu, Chang

Summary

In this paper, they try to bridge the lexical gap between bug reports and source codes using word embedding. They trained a shared word embedding consists of both NL words and code tokens. The embedding is generated based on API docs, tutorials, and reference docs. One interesting step in their tokenization is after splitting the code token CodeToken into code and token, they also kept CodeToken in the vocab. After converting a document into corresponding embedding vectors, they generate the similarity between two docs based on the following formulas.

Finally, they empirically showed that incorporating word embedding can significantly improve the performance of baseline models. An interesting takeaway is word embedding alone cannot perform better than IR. However, a combination of WE and IR can outperform raw IR.

Contributions of The Paper

Adapting skip-gram to generate shared vector space for code tokens and NL words
A method for computing similarity between queries and documents based on word embedding
Extensive empirical experiment to evaluate the utility of the proposed technique

Comments

Well readable

RAISEDAL / RAISEReadingList