In this paper, the authors perform an empirical study to assess the performance of three sentiment analysis tools- Senti4SD, SentiStrengthSE and SentiCR which are specifically developed for identifying sentiments in the software engineering domain. They compare the performance of these tools with the baseline represented by Sentistrength. To evaluate the performance of these tools, they collect 4 different datasets those are StackOverflow posts, Jira issue comments, Java libraries and Code review comments. Furthermore, they manually annotated these datasets by adopting both ad-hoc and model-driven approaches to build the gold set. Initially, they investigate to what extent these sentiment analysis tools perform compare to baseline techniques. They observe that the SentiCR performs best on the Jira dataset. This phenomenon could be the reason for SMOTE optimization of label distribution in the training dataset. However, similar performance is also observed for Senti4SD and SentiStrength as well. As for the Stack Overflow dataset, the best performing approaches is Senti4SD (F-measure micro = .87 and macro = .86, = .83), followed by SentiCR (F-measure micro = .82 and macro = .82, = .76) and SentiStrengthSE (F-measure micro = .80 and macro = .80, = .74). Afterwards, they investigate how these sentiment analysis tools agree with each other.
To perform this investigation, they evaluate the sentiment analysis tools on the model-driven annotation dataset. In addition to Cohen k, they assess the interrater agreement also in terms of the percentage of cases for which the tools agree (perfect agreement) as well as the percentage of cases for which severe (positive vs. negative, and vice versa) and mild disagreement (positive/negative vs. neutral) occur. They observe, substantial to the perfect agreement for all couples of tools. In the case of the worst scenario, only 3% disagreement was observed on the Stack Overflow dataset while no disagreement was observed for the Jira dataset. Finally, they also compare the performance of these tools using both model-driven and ad-hoc annotation datasets. Compared to the model-driven annotation, they observe a drop in the performance of these sentiment analysis tools.
To get a deeper insight into the difficulties inherent to sentiment detection in software engineering, we manually examined cases for which all three tools yielded a wrong prediction. They categorize different errors such as polar facts, general error, subjectivity in the annotation and so on. They found that polar fact is responsible for the highest classification error.
Contributions of The Paper
Provide useful insight into the performance of existing sentiment analysis tools
Build gold sets using both ad-hoc and model-driven approach
Provide actionable insight by manually examining the data to find out different facets that are responsible for misclassification.
Publisher
International Conference on Mining Software Repositories
Link to The Paper
https://dl.acm.org/doi/10.1145/3196398.3196403
Name of The Authors
Nicole Novielli, Daniela Girardi, Filippo Lanubile
Year of Publication
2018
Summary
In this paper, the authors perform an empirical study to assess the performance of three sentiment analysis tools- Senti4SD, SentiStrengthSE and SentiCR which are specifically developed for identifying sentiments in the software engineering domain. They compare the performance of these tools with the baseline represented by Sentistrength. To evaluate the performance of these tools, they collect 4 different datasets those are StackOverflow posts, Jira issue comments, Java libraries and Code review comments. Furthermore, they manually annotated these datasets by adopting both ad-hoc and model-driven approaches to build the gold set. Initially, they investigate to what extent these sentiment analysis tools perform compare to baseline techniques. They observe that the SentiCR performs best on the Jira dataset. This phenomenon could be the reason for SMOTE optimization of label distribution in the training dataset. However, similar performance is also observed for Senti4SD and SentiStrength as well. As for the Stack Overflow dataset, the best performing approaches is Senti4SD (F-measure micro = .87 and macro = .86, = .83), followed by SentiCR (F-measure micro = .82 and macro = .82, = .76) and SentiStrengthSE (F-measure micro = .80 and macro = .80, = .74). Afterwards, they investigate how these sentiment analysis tools agree with each other. To perform this investigation, they evaluate the sentiment analysis tools on the model-driven annotation dataset. In addition to Cohen k, they assess the interrater agreement also in terms of the percentage of cases for which the tools agree (perfect agreement) as well as the percentage of cases for which severe (positive vs. negative, and vice versa) and mild disagreement (positive/negative vs. neutral) occur. They observe, substantial to the perfect agreement for all couples of tools. In the case of the worst scenario, only 3% disagreement was observed on the Stack Overflow dataset while no disagreement was observed for the Jira dataset. Finally, they also compare the performance of these tools using both model-driven and ad-hoc annotation datasets. Compared to the model-driven annotation, they observe a drop in the performance of these sentiment analysis tools. To get a deeper insight into the difficulties inherent to sentiment detection in software engineering, we manually examined cases for which all three tools yielded a wrong prediction. They categorize different errors such as polar facts, general error, subjectivity in the annotation and so on. They found that polar fact is responsible for the highest classification error.
Contributions of The Paper
Comments
No response