In the software industry, reporting software bugs is a crucial step. Upon assigning the bug
reports, software developers spend a lot of time fixing the bugs. Still, they must be checked for
duplication before assigning the bugs to make the whole process effective and save valuable time and
resources. In this research [1], a hybrid model using the topic modeling with pre-trained word embedding
is implemented, based on LDA with the conjunction of pre-trained neural network-based word embedding
(fastText, GloVE, Word2Vec, and a fusion them) are then employed for feature extraction. To measure the
textual similarity, a unified similarity (hybridization of Cosine similarity and Euclidean distance)
measurement is used for ranking the topmost similar bug reports. The suggested methodology was
evaluated on the Eclipse dataset, which included over 80,000 bug reports that included both master and
duplicate reports, and just the report descriptions were used to detect duplicates. With three times faster
calculation than the traditional classification model, their hybrid model obtains a recall rate of 67% for
Top-N predictions.
Contributions of The Paper
LDA generates the Document-Topic matrix on Bag of Words (BOW); after that, for every
the document, the model produces a probability distribution of topics in the document [2], LDA overcomes
the Vocabulary Mismatch Problem [3]
Incorporating both clustering and classification fastens the whole process of detecting duplicate bug
reports
Instead of using one technique for text vectorization, they have used different embeddings to validate
the findings
Instead of choosing the topic number for LDA arbitrarily, they used a coherence score to determine the
the optimum number of topics which is 10 in this research
Extensive performance evaluation and discussion of all the experiments, as well as comparison with
other existing techniques in terms of time and performance results =
Outlining the common pitfalls for developing the detection system
Comments
This research proposed a hybrid model leveraging both clustering and classification. The model exploits
Latent Dirichlet Allocation (LDA) for topic-based clustering, single-modal, and multi-modal text
representation, and a unified text similarity measure using Cosine and Euclidean metrics for ranking the
topmost similar bug reports.
After modeling the topics (n=10), they applied fastText, GloVe, and a hybrid combination of all of them
for word embedding. Facebook proposed fastText [6] in 2016, which augments the extensively adopted
Word2Vec [5] word embedding approach relies on the skip-gram model to portray each word as a bag
of character n-grams rather than feeding single words into the neural network. The second embedding
approach is GLoVe [6], a count-based model that differs from word2vec, a predictive model. It is based
on matrix factorization techniques on the word-context matrix and is a count-based model. These two
neural natural language processing techniques are implemented on the top 10 clusters generated from
LDA individually for feature extraction. Lastly, they used multi-modality feature extraction by employing
all the combinations of Word2Vec, fastText, and GloVe simply by calculating the average concatenation
operation of vectors from both models. The concatenation approach is not validated by any relevant
research or even by another dataset; what is the purpose of concatenating as they are already pre-trained?
The limitation or weakness of different approaches in a literature review where similar work in a slightly
different domain exists is not mentioned. The paper might have missed a few recommendation techniques
from the literature.
The evaluation part has only evaluated the model using 200 sample duplicate bug reports, whereas
relevant researchers use a larger sample size. What is the validation and explanation behind choosing 200
as the sample size? This small sample size does not represent with 95% confidence interval with a 5%
margin error in terms of the data size.
While choosing the K values for recommending the top k most similar duplicate bug reports, they have
chosen 2.5k as one of the K values. No other existing research has used this as their K value, and in
terms of real-life application, this is not feasible. What is the point of demonstrating the result using such
a value that will not be used in industry settings?
To measure the similarity between bug reports, instead of just one similarity measurement technique, they
used a unified similarity measure consisting of Cosine similarities and Euclidean distance by taking the
average like in the research to generate Top-K recommendations of duplicate bug reports. But in theory,
for document similarity in duplicate bug reports detection, primarily Cosine Similarity is used because it
can mitigate the drawback of Euclidean Similarity measurement [4]. The disadvantage is that if two data
vectors share no attribute values, their distance may be lower than another pair of data vectors with the
same attribute values.
According to the experiment, the single-modality feature extraction using GloVe performed as the second-best model in the research paper Recall rate-wise. Still, it outperforms their best-claimed multi-modality
feature extraction model (Fusion of FastText and GloVe) in terms of time; why are the authors
recommending the Fusion of FastText and GloVe as the best approach in this paper? Is the trade-off of
computational resources and time worthwhile in this case for little to no improved accuracy rate?
The future work for this research is not stated clearly in conclusion. Threats to validity are
missing as well from the paper.
Publisher
2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC)
Link to The Paper
https://ieeexplore.ieee.org/document/9283289
Name of The Authors
Thangarajah Akilan; Dhruvit Shah; Nishi Patel; Rinkal Mehta
Year of Publication
2020
Summary
In the software industry, reporting software bugs is a crucial step. Upon assigning the bug reports, software developers spend a lot of time fixing the bugs. Still, they must be checked for duplication before assigning the bugs to make the whole process effective and save valuable time and resources. In this research [1], a hybrid model using the topic modeling with pre-trained word embedding is implemented, based on LDA with the conjunction of pre-trained neural network-based word embedding (fastText, GloVE, Word2Vec, and a fusion them) are then employed for feature extraction. To measure the textual similarity, a unified similarity (hybridization of Cosine similarity and Euclidean distance) measurement is used for ranking the topmost similar bug reports. The suggested methodology was evaluated on the Eclipse dataset, which included over 80,000 bug reports that included both master and duplicate reports, and just the report descriptions were used to detect duplicates. With three times faster calculation than the traditional classification model, their hybrid model obtains a recall rate of 67% for Top-N predictions.
Contributions of The Paper
Comments
This research proposed a hybrid model leveraging both clustering and classification. The model exploits Latent Dirichlet Allocation (LDA) for topic-based clustering, single-modal, and multi-modal text representation, and a unified text similarity measure using Cosine and Euclidean metrics for ranking the topmost similar bug reports.
After modeling the topics (n=10), they applied fastText, GloVe, and a hybrid combination of all of them for word embedding. Facebook proposed fastText [6] in 2016, which augments the extensively adopted Word2Vec [5] word embedding approach relies on the skip-gram model to portray each word as a bag of character n-grams rather than feeding single words into the neural network. The second embedding approach is GLoVe [6], a count-based model that differs from word2vec, a predictive model. It is based on matrix factorization techniques on the word-context matrix and is a count-based model. These two neural natural language processing techniques are implemented on the top 10 clusters generated from LDA individually for feature extraction. Lastly, they used multi-modality feature extraction by employing all the combinations of Word2Vec, fastText, and GloVe simply by calculating the average concatenation operation of vectors from both models. The concatenation approach is not validated by any relevant research or even by another dataset; what is the purpose of concatenating as they are already pre-trained? The limitation or weakness of different approaches in a literature review where similar work in a slightly different domain exists is not mentioned. The paper might have missed a few recommendation techniques from the literature.
The evaluation part has only evaluated the model using 200 sample duplicate bug reports, whereas relevant researchers use a larger sample size. What is the validation and explanation behind choosing 200 as the sample size? This small sample size does not represent with 95% confidence interval with a 5% margin error in terms of the data size.
While choosing the K values for recommending the top k most similar duplicate bug reports, they have chosen 2.5k as one of the K values. No other existing research has used this as their K value, and in terms of real-life application, this is not feasible. What is the point of demonstrating the result using such a value that will not be used in industry settings?
To measure the similarity between bug reports, instead of just one similarity measurement technique, they used a unified similarity measure consisting of Cosine similarities and Euclidean distance by taking the average like in the research to generate Top-K recommendations of duplicate bug reports. But in theory, for document similarity in duplicate bug reports detection, primarily Cosine Similarity is used because it can mitigate the drawback of Euclidean Similarity measurement [4]. The disadvantage is that if two data vectors share no attribute values, their distance may be lower than another pair of data vectors with the same attribute values.
According to the experiment, the single-modality feature extraction using GloVe performed as the second-best model in the research paper Recall rate-wise. Still, it outperforms their best-claimed multi-modality feature extraction model (Fusion of FastText and GloVe) in terms of time; why are the authors recommending the Fusion of FastText and GloVe as the best approach in this paper? Is the trade-off of computational resources and time worthwhile in this case for little to no improved accuracy rate? The future work for this research is not stated clearly in conclusion. Threats to validity are missing as well from the paper.