Paper Review: Does This Apply to Me? An Empirical Study of Technical Context in Stack Overflow

Publisher

Mining Software Repositories (MSR '22)

Link to The Paper

https://arxiv.org/abs/2204.00110

Name of The Authors

Akalanka Galappaththi, Sarah Nadi, Christoph Treude

Year of Publication

31 Mar 2022

Summary

Locating the appropriate piece of information that is meaningful for a developer's task on Stack Overflow can be challenging, especially in long threads, given the large abundance of knowledge available. This research focuses on discovering additional contexts on Stack Overflow that may serve as navigational signals for Stack Overflow users. They interpret context in the study as information that specifies the technologies and assumptions that go into a specific inquiry or response. The additional context appears in a thread's answer or remark. However, it does not overlap with the context of the inquiry. They discussed the beneficial implications of their findings for enhancing Stack Overflow threads with navigational signals based on additional contexts after conducting a quantitative and qualitative empirical investigation.

Contributions of The Paper

A novel empirical study focuses on finding additional contexts on Stack Overflow
The paper is well structured, and I enjoyed reading it
Usage of adequate figures and diagrams makes the flow more comprehensible
Using recent data from Stack Overflow eliminates the chance of concept drift
Taking three annotator’s opinions for generating the results instead from a large sample set

Comments

Although they claimed that no existing research focuses explicitly on finding additional contexts on Stack Overflow, several studies, and related research are based on the same motivation in this relevant topic. There are not sufficient details of those pieces of literature in the related work section. For example, they have implemented a couple of existing techniques, such as open or close card sorting or Witt taxonomy; no details are present regarding these techniques in this study.

In this research, they conducted an empirical investigation to determine the frequency and the significance of technical context in Stack Overflow replies and comments, using tags as a proxy for technical context; however, the motivating rationale for this study is insufficient. What is the immediate benefit of this study to Stack Overflow users? To what degree are their results likely to enhance present navigational cueing techniques? How different is more context from generic tags for the automated process? These questions are not addressed explicitly.

With over 21 million questions on Stack Overflow, they have narrowed the study down to queries with one of three tags: JSON, regex, or Django. They stated they picked these three tags for variety, but how did they choose them in particular? What type of diversification is it implied, given that all three are functionally distinct? What criteria are used to select the topic system? For example, it might be a programming language issue, a framework/API issue, or a function issue. However, further information on the setting is essential for a deeper understanding of the study.

In 3.1.2, they have eliminated 99 percent of the categories rarely used to describe tags and are not meaningful. What criteria did they use to establish the significance? Is it by manual examination or via the use of automated tools? They discovered that while their chosen tags do not capture meaningless tags in sentences as technological contexts, they capture tags that indicate technologies. What led them to this conclusion? What method was employed to discover this insight?

In Identify sentences with additional context, they have filtered out phrases where the detected context overlaps with the question's context, based on the identified answer, and remark sentences with technical context. What about the technical background with bigram and trigram throughout the filtering process? How do they handle the issue if one or two words meet the context of the inquiry but the rest does not? What procedure would they use to filter out such information when dealing with such a problem? Who are the annotators? What kind of factors are taken into consideration while choosing three annotators? Is there any difference in qualification and attribute-wise among the annotators? As most of the findings come from the annotator's insights and responses, it would be better if some elaboration was provided.

In the Distributing threads between annotators, why is Fleiss' kappa score considered to calculate the ambiguity of the coding task? No reasoning behind this choice is provided. Is there a process they employed to communicate their discrepancies? They used the findings from their discussion to reduce ambiguity in the coding guide. How did they remove those ambiguities? There is a lack of information in this section.

Three annotators begin by annotating 30% of the discussions in the sample to find additional context. The remaining threads were distributed among the authors to be annotated separately. When two distinct strategies are used to achieve the same goal, the outcome may be skewed. In 3.3.1, for Identifying categories of additional context, they stated that they did not utilize tag categories in Witt for this project since each tag has several alternative categories. What if more than one category for the tag makes sense and contains valuable information?

In 3.3.2 Identifying reasons for mentioning additional context, they observed poor annotator agreement throughout the pilot round. They have enhanced the coding guide by providing examples of applying each code. How did the procedure go? What is the low kappa score for annotator agreement?

RAISEDAL / RAISEReadingList