SEmotion'18 reviews - verbatim

nasifimtiazohi commented 6 years ago

----------------------- REVIEW 1 --------------------- PAPER: 9 TITLE: Sentiment and Politeness Analysis Tools on Developer Discussions Are Unreliable, but so Are People AUTHORS: Nasif Imtiaz, Justin Middleton, Peter Girouard and Emerson Murphy-Hill

----------- Overall evaluation ----------- In this paper authors studied the reliability of politeness an sentiment analysis tools when applied in software engineering domain. They used 4 human raters rating 598 comments sampled from GitHub pull request comments. The concluded that the five sentiment analysis tools used and the politeness tool have poor agreement with human raters and that even human raters had poor agreement among them. The paper is well written and easy to follow, the motivation are in topic with the workshop and are clearly stated. The reliability of affect analysis tools is an important topic in SE due to the lack of domain specific tools. However I have main concerns about the methodology adopted in this paper that let me decide for rejection.

1 - The unreliability of sentiment tool in SE domain has already been studied in the papers cited and also by Lin et. al. in "Sentiment Analysis for Software Engineering for example:

How Far Can We Go?", with respect to other studies about sentiment analysis tools they only add a new tool Senti4SD.

where author manually rated 40k comments. Hence, regarding sentiment analysis tools the paper does not add relevant contribution. (#17)

2 - In section 2.3 author state "We choose GitHub for our research setting as it is the largest code host and the comments are more representative of developers’ discussion in general." It is not clear to me why GitHub is more representative of developers discussions than other platforms where developers discuss such as Jira, StackOverFlow or Reddis for example. (#8)

3 - In section 3.4 author state that "It was trained and tested on different corpora and hence has the claim to be domain independent. It returns a politeness score between 0 to 1 for each texts with 1 being the most polite." This is not correct since the tool returns a label polite/impolite along with a confidence level of the given label that range from 0 to 1. (#11)

4 - Another concern is about the rating process, among the four rater, Coder 1 rated all comments and later discuss with other raters to resolve conflicts on rated comments. This is a "human bias" because Coder 1 rated all comments in the dataset and may had influenced other raters. (#12)

5 - The politeness tool is trained to classify "question and answer" containing exactly 2 sentences and not general text. Providing text not in this form may yield to biased results.The tool provide also an utility to classify a text to be in the "question and answer" form which can be used to filter out comment not in that form. (#11)

6 - The politeness tool was trained with 10k requests rated by about 400 raters that have been accurately chosen using Amazon Mechanical Torque and each request wan annotated by 5 raters. In this study author rated 598 comments (not requests) and each comment was rated by 2 raters. This lead to unreliable results to me, even I do agree to the need of a reliable evaluation of the politeness tool in SE. (#11)

7 - In threats to validity author state that "Finally, while we randomly picked 589 comments, they might not be representative of the whole GitHub community." GHTorrent dateset hosts tens of millions of developers' comments, 598 sampled comments are not representative at all. (#3)

----------------------- REVIEW 2 --------------------- PAPER: 9 TITLE: Sentiment and Politeness Analysis Tools on Developer Discussions Are Unreliable, but so Are People AUTHORS: Nasif Imtiaz, Justin Middleton, Peter Girouard and Emerson Murphy-Hill

----------- Overall evaluation ----------- The paper presents a study on sentiment and politeness analysis on GitHub comments from pull requests and issues. A dataset of 589 manually rated GitHub comments is presented and compared to existing sentiment and politeness tools. A coding scheme is also developed to quantify politeness. The results show that the tools do not agree with each other but also do not agree with human ratings.

Overall, the paper is well fit for SEmotions. I am sure it will generate some nice discussion at the workshop. Results seem to corroborate with prior findings. What is new is the set of hand coded comments which could be valuable. Compared to prior work, this paper takes a different approach in that, it uses GitHub comments instead of commits. They correctly state that Jongeling did something similar with Jira. The authors also point out that coming up with a politeness coding scheme is easier than coming up with a sentiment coding scheme.

I have listed below some questions/comments about the paper to help with your camera ready version.

Do the authors plan to make the dataset itself public? (#13)

The research questions just seem to be presented abruptly in the intro to Section 3 right before 3.1. Perhaps reconsider this presentation. These can also go into the Introduction section. (#18)

I wonder if the authors thought about how the tools might work if the text of the URL's was also processed. I mean the raters were provided with these URLs so it could be that they used them. How can we know for certain? (#10)

Give some indication of which projects the 589 comments were taken. Were they actual software projects? Some more information about them would help. Could you add a sentence or two? (#3)

How does Table 1 related to the scores of -2 to 2? I was lost on this one. (#8)

What about SentiCR? They also claim to have an SE focus. Why was that not chosen? (#15)

In your tool selection section, some tools have the author of the tool mentioned while others don't. Perhaps state authors for all in this case. (#4)

Table 2's caption should state what those numbers are. i.e. kappa (#9)

Is a coding scheme presented for quantifying sentiment? Why was one presented for politeness only? (#8)

Just curious if Coder 1 is one of the paper's authors? (#14)

I didn't quite understand what the following sentence meant: "However, Coder 1’s agreement on politeness is higher than sentiment with both the other coders. One reason behind this can be that the annotation scheme for politeness was more detailed with relevant examples." (#8)

In Section 4.3. RQ3 you mention "our politeness tool". This implies that you created it. Please rephrase. (#2)

It is interesting to see that SentiSD has even more neutral items than SentiStrength. Whether or not they are the same items is left to be seen. (#7)

Developers use emoticons or emojis in their comments. How did you deal with those? Were they stripped out? (#6)

The paper is missing the following citations (#16) Vinayak Sinha, Alina Lazar, Bonita Sharif: Analyzing developer sentiment in commit logs. MSR 2016: 520-523

Felipe Ebert, Fernando Castor, Nicole Novielli, Alexander Serebrenik: Confusion Detection in Code Reviews. ICSME 2017: 549-553

Daviti Gachechiladze, Filippo Lanubile, Nicole Novielli, Alexander Serebrenik: Anger and Its Direction in Collaborative Software Development. ICSE-NIER 2017: 11-14

Alexander Serebrenik: Emotional Labor of Software Engineers. BENEVOL 2017: 1-6

Minor (#2 )

Change threat to validity to threats to validity.
Some repetition can be avoided such as "We chose GitHub as our research setting..." is said twice within two paragraphs on page 2.
each hundred comment -> every hundred comments
of Computer Science department -> of the Computer Science department
rephrase "We also gave a short sample consisting 10 examples....."
The tool was build -> The tool was built
fix LaTeX issues of line overflow.
found very poor result -> found very poor results
threat to validity -> threats to validity
conclusion -> conclusions and future work (#5) (Please proof read the paper)

----------------------- REVIEW 3 --------------------- PAPER: 9 TITLE: Sentiment and Politeness Analysis Tools on Developer Discussions Are Unreliable, but so Are People AUTHORS: Nasif Imtiaz, Justin Middleton, Peter Girouard and Emerson Murphy-Hill

----------- Overall evaluation ----------- The authors describes a study which examined the reliability of different analysis tools for detecting sentiment polarity and politeness expressed in GitHub comments. The authors compared the analysis tools results against a manually coded set of 589 comments. The authors found that overall tools are not reliable for detecting sentiment and politeness and that the agreement rate between human coders was relatively low due to the subjectivity of the task.

This work adds to ongoing research pinpointing the weaknesses of automated analysis techniques for detecting affect in software engineering artifacts. The main difference with previous work is the chose artifact. It is well written and structured, and makes for a very enjoyable read. I believe that it could contribute to interesting discussions during the workshop. I particularly liked the description of the rationale behind the decisions made in the study design.

Points for improvement:

In the description of the tools the authors should shortly describe if the tools are dictionary-based or if they use a type of supervised model; for the latter a short mention of the type of artifacts with which the tools were trained would be useful to understand the differences in the application/training domain. (#4)

A short description of possible future work would strengthen the paper. (#5)

it is unclear if the comments selected came from a random selection of all projects available in GHTorrent or from specific projects. (#3)

CaptainEmerson commented 6 years ago

Missing issues for the following suggestions:

Hence, regarding sentiment analysis tools the paper does not add relevant contribution.

Do the authors plan to make the dataset itself public?

The research questions just seem to be presented abruptly in the intro to Section 3 right before 3.1. Perhaps reconsider this presentation. These can also go into the Introduction section.

The research questions just seem to be presented abruptly in the intro to Section 3 right before 3.1. Perhaps reconsider this presentation. These can also go into the Introduction section.

What about SentiCR? They also claim to have an SE focus. Why was that not chosen?

Just curious if Coder 1 is one of the paper's authors?

The paper is missing the following citations

You don't have to fix things that you disagree with, but you need to provide a rationale in the issue. Furthermore, if the reviewer is confused or wrong, you need to figure out how to make future readers not be confused or wrong.

nasifimtiazohi commented 6 years ago

@CaptainEmerson , I am still adding issues and I will address all of them in this repo. I am opening an issue, addressing it, and closing them one by one.

nasifimtiazohi commented 6 years ago

all of them have issues now and being addressed

DeveloperLiberationFront / AffectAnalysisToolEvaluation

SEmotion'18 reviews - verbatim #1