This paper applies BERT to the problem of changeset-based bug localization with the goal of improved retrieval quality, especially on bug reports where straightforward textual similarity would not suffice. They describe an architecture for IR that leverages BERT without compromising retrieval speed and response time. In addition, they examine a number of design decisions that can be beneficial in leveraging BERT-like models for bug localization, including how best to encode changesets and their unique structure. They compare the accuracy and performance of their model to a non-contextual baseline (i.e., vector space model) and BERT-based architectures previously used in software engineering. The evaluation results demonstrate advantages in using the proposed BERT model compared to the baselines, especially for bug reports that lack any hints about related code elements
Contributions of The Paper
The main contributions of this paper are:
Approach that applies BERT to the bug localization problem (specifically, localizing bug-inducing changesets) that is more accurate than the state-of-the-art.
Improvement over other recent BERT-based architectures proposed towards changeset retrieval, showing significant advantages with respect to retrieval speed.
Evaluation and recommendations for key design choices in applying BERT to changesets (i.e., code change encoding, data granularity).
The BERT-based technique proposed in this paper enables semantic retrieval of software artifacts (specifically, changesets) for bug localization that goes beyond (and can complement) the exact term matching in the current popular state-of-the-art techniques. Relative to a similar, recent BERT-based technique , they offer an approach that improves retrieval speed significantly, in a way that supports real-world use, while also enhancing retrieval quality.
Their approach uses the popular BERT model to more accurately match the semantics in the bug report text to the inducing changeset. More specifically, they describe the FBL-BERT model, based on the prior work by Khattab et al., which speeds up the retrieval of results while performing fine-grained matching across all embeddings in the two documents. The results show an improvement in retrieval accuracy for bug reports that lack localization hints or have only partial hints.
FBL-BERT utilizes higher-level association between bug reports and bug-introducing changesets, which can result in exact matches getting less emphasis. Interestingly, the highest improvement in retrieval accuracy is observed for BRPL indicating that the model can effectively retrieve changesets based on partial clues by associating them with patterns learned from historical data. The performance of both TBERT models and FBL-BERT improves when the models are trained and evaluated on hunks or changeset files.
Publisher
ICSE
Link to The Paper
https://arxiv.org/pdf/2112.14169v1.pdf
Name of The Authors
Agnieszka Ciborowska, Kostadin Damevski
Year of Publication
2022
Summary
This paper applies BERT to the problem of changeset-based bug localization with the goal of improved retrieval quality, especially on bug reports where straightforward textual similarity would not suffice. They describe an architecture for IR that leverages BERT without compromising retrieval speed and response time. In addition, they examine a number of design decisions that can be beneficial in leveraging BERT-like models for bug localization, including how best to encode changesets and their unique structure. They compare the accuracy and performance of their model to a non-contextual baseline (i.e., vector space model) and BERT-based architectures previously used in software engineering. The evaluation results demonstrate advantages in using the proposed BERT model compared to the baselines, especially for bug reports that lack any hints about related code elements
Contributions of The Paper
The main contributions of this paper are:
Approach that applies BERT to the bug localization problem (specifically, localizing bug-inducing changesets) that is more accurate than the state-of-the-art.
Improvement over other recent BERT-based architectures proposed towards changeset retrieval, showing significant advantages with respect to retrieval speed.
Evaluation and recommendations for key design choices in applying BERT to changesets (i.e., code change encoding, data granularity).
The BERT-based technique proposed in this paper enables semantic retrieval of software artifacts (specifically, changesets) for bug localization that goes beyond (and can complement) the exact term matching in the current popular state-of-the-art techniques. Relative to a similar, recent BERT-based technique , they offer an approach that improves retrieval speed significantly, in a way that supports real-world use, while also enhancing retrieval quality.
Their approach uses the popular BERT model to more accurately match the semantics in the bug report text to the inducing changeset. More specifically, they describe the FBL-BERT model, based on the prior work by Khattab et al., which speeds up the retrieval of results while performing fine-grained matching across all embeddings in the two documents. The results show an improvement in retrieval accuracy for bug reports that lack localization hints or have only partial hints.
FBL-BERT utilizes higher-level association between bug reports and bug-introducing changesets, which can result in exact matches getting less emphasis. Interestingly, the highest improvement in retrieval accuracy is observed for BRPL indicating that the model can effectively retrieve changesets based on partial clues by associating them with patterns learned from historical data. The performance of both TBERT models and FBL-BERT improves when the models are trained and evaluated on hunks or changeset files.
Comments
No response