RAISEDAL / RAISEReadingList

This repository contains a reading list of Software Engineering papers and articles!
0 stars 0 forks source link

Paper Review: From word embeddings to document similarities for improved information retrieval in software engineering #30

Open parvezmrobin opened 2 years ago

parvezmrobin commented 2 years ago

Publisher

Proceedings - International Conference on Software Engineering

Link to The Paper

https://doi.org/10.1145/2884781.2884862

Name of The Authors

Ye, Xin; Shen, Hui; Ma, Xiao; Bunescu, Razvan; Liu, Chang

Summary

In this paper, they try to bridge the lexical gap between bug reports and source codes using word embedding. They trained a shared word embedding consists of both NL words and code tokens. The embedding is generated based on API docs, tutorials, and reference docs. One interesting step in their tokenization is after splitting the code token CodeToken into code and token, they also kept CodeToken in the vocab. After converting a document into corresponding embedding vectors, they generate the similarity between two docs based on the following formulas. image image image

Finally, they empirically showed that incorporating word embedding can significantly improve the performance of baseline models. An interesting takeaway is word embedding alone cannot perform better than IR. However, a combination of WE and IR can outperform raw IR.

Contributions of The Paper

  1. Adapting skip-gram to generate shared vector space for code tokens and NL words
  2. A method for computing similarity between queries and documents based on word embedding
  3. Extensive empirical experiment to evaluate the utility of the proposed technique

Comments

  1. Well readable