This research performs a replication study to investigate to what extent SE-specific sentiment analysis tools mitigate the threats to conclusion validity highlighted by previous research. In this study, they replicated four sentiment analysis tools-- SentiStrength-SE, DEVA, Senti4SD and SentiCR. The former two tools rely on sentiment lexicons including polarity scores at the word level in the text whereas the latter ones are supervised sentiment analysis tools. For the supervised tools, they use the classifier trained on the original gold standard as their respective authors whereas the lexicon-based tools are implemented as their original studies. Furthermore, They used two different datasets to answer their three research question. The first dataset consists of 60,658 commits and 54,892 pull requests from the MSR 2014 mining challenge dataset and the second one is consist of 87,373 questions extracted from the Stack Overflow dump. For the first dataset, they analyzed the sentiment at the various levels--pull request comment, discussion and commit comment and discussion. To answer their first RQ, they confirmed the findings of these replication studies while they also observe the off-the-shelf use of sentiment analysis tools might lead to different results at a fine label of granularity due to the differences in the label distribution. Furthermore, they measure the agreement among the four sentiment analysis tools using weighted Cohen kappa (k). They notice that the agreement between the replicated sentiment analysis tools varies from slight to moderate. The highest agreement is found between SentiStrength and Deva since they are developed based on the same lexicon dictionary. They also observe that identifying sentiment in the Large text is difficult compared to the smaller one. Furthermore, they manually annotated 600 documents randomly selected from the Github and Stack overflow using two annotators. The further investigation of the agreement between each tool with the manual annotation. They notice that the agreement between tools and the gold label is higher for the short text and the main cause of the error is the presence of lexical cues which are either wrongly processed or overlooked by the sentiment analysis tools.
Contributions of The Paper
Confirm the findings of the previous study by replicating four existing sentiment analysis tools.
Provide useful insights about the performance of sentiment analysis tools on the off-the-self datasets.
Show the evidence of conclusion validity that could arise due to the choice of sentiment analysis tools
Provide valuable direction for using existing sentiment analysis tools (i.e., retrain the tools or perform major voting ensembling multiple tools)
Comments
In this study, they observed a high agreement between each tool and gold set for shorter documents. However, they did not investigate or provide any insight into the size of the shorter document. For instance, which size of the document is considered a short document or a long one.
Publisher
Empirical Software Engineering (ESE)
Link to The Paper
https://arxiv.org/abs/2010.10172
Name of The Authors
Nicole Novielli, Fabio Calefato, Flippo Lanubille
Year of Publication
2021
Summary
This research performs a replication study to investigate to what extent SE-specific sentiment analysis tools mitigate the threats to conclusion validity highlighted by previous research. In this study, they replicated four sentiment analysis tools-- SentiStrength-SE, DEVA, Senti4SD and SentiCR. The former two tools rely on sentiment lexicons including polarity scores at the word level in the text whereas the latter ones are supervised sentiment analysis tools. For the supervised tools, they use the classifier trained on the original gold standard as their respective authors whereas the lexicon-based tools are implemented as their original studies. Furthermore, They used two different datasets to answer their three research question. The first dataset consists of 60,658 commits and 54,892 pull requests from the MSR 2014 mining challenge dataset and the second one is consist of 87,373 questions extracted from the Stack Overflow dump. For the first dataset, they analyzed the sentiment at the various levels--pull request comment, discussion and commit comment and discussion. To answer their first RQ, they confirmed the findings of these replication studies while they also observe the off-the-shelf use of sentiment analysis tools might lead to different results at a fine label of granularity due to the differences in the label distribution. Furthermore, they measure the agreement among the four sentiment analysis tools using weighted Cohen kappa (k). They notice that the agreement between the replicated sentiment analysis tools varies from slight to moderate. The highest agreement is found between SentiStrength and Deva since they are developed based on the same lexicon dictionary. They also observe that identifying sentiment in the Large text is difficult compared to the smaller one. Furthermore, they manually annotated 600 documents randomly selected from the Github and Stack overflow using two annotators. The further investigation of the agreement between each tool with the manual annotation. They notice that the agreement between tools and the gold label is higher for the short text and the main cause of the error is the presence of lexical cues which are either wrongly processed or overlooked by the sentiment analysis tools.
Contributions of The Paper
Comments
In this study, they observed a high agreement between each tool and gold set for shorter documents. However, they did not investigate or provide any insight into the size of the shorter document. For instance, which size of the document is considered a short document or a long one.