Xiao, Yan and Keung, Jacky and Bennin, Kwabena and E. Mi, Qing
Summary
This paper proposes a new CNN-based technique for bug localization. First, they extracted embedding from bug reports and source codes using word2vec and sent2vec. The split the identifiers based on camelCasing to deal with the vocabulary mismatch problem. Later they extracted the local features of the bug report and source codes. These extracted features are then concatenated vertically to produce a two-row input for a so-called enhanced CNN. This enhanced CNN takes source codes' recency and fix-frequency into account while classifying. To keep the training feasible, they only used the most dissimilar (cosine) 300 source codes as negative samples. Obviously, like any other paper, they finally outperformed the baselines.
Contributions of The Paper
They used word2vec for title embedding and sent2vec for description embedding, which is interesting. Essentially, each sentence in the description produces a single embedding which is good to keep the computation feasible.
All code tokens are spilt into granular words based on camelCasing (I think this is not a new thing).
Integrated source files' recency and fix-frequency with CNN's loss function.
Comments
They said in a t-SNE projection, if two points are close to each other, then they are similar. However, since t-SNE is not a linear projection, this statement is not true.
Publisher
Information and Software Technology
Link to The Paper
https://doi.org/10.1016/j.infsof.2018.08.002
Name of The Authors
Xiao, Yan and Keung, Jacky and Bennin, Kwabena and E. Mi, Qing
Summary
This paper proposes a new CNN-based technique for bug localization. First, they extracted embedding from bug reports and source codes using word2vec and sent2vec. The split the identifiers based on camelCasing to deal with the vocabulary mismatch problem. Later they extracted the local features of the bug report and source codes. These extracted features are then concatenated vertically to produce a two-row input for a so-called enhanced CNN. This enhanced CNN takes source codes' recency and fix-frequency into account while classifying. To keep the training feasible, they only used the most dissimilar (cosine) 300 source codes as negative samples. Obviously, like any other paper, they finally outperformed the baselines.
Contributions of The Paper
Comments