Analysis of ChatGPT answers to SO questions

Q&A platforms have been an integral part of the web-help-seeking behavior of programmers over the past decade. However, with the recent introduction of ChatGPT, the paradigm of web-help-seeking behavior is experiencing a shift. Despite the popularity of ChatGPT, no comprehensive study has been conducted to evaluate the characteristics or usability of ChatGPT’s answers to software engineering questions. To bridge the gap, we conducted the first in-depth analysis of ChatGPT’s answers to 517 Stack Overflow (SO) questions and examined the correctness, consistency, comprehensiveness, and conciseness of ChatGPT’s answers. Furthermore, we conducted a large-scale linguistic analysis, and a user study to understand the characteristics of ChatGPT answers from linguistic and human aspects. Our analysis shows that 52% of ChatGPT answers are in- correct and 77% are verbose. Nonetheless, ChatGPT answers are still preferred 39.34% of the time due to their comprehensiveness and well-articulated language style. Our result implies the necessity of close examination and rectification of errors in ChatGPT, at the same time creating awareness among its users of the risks associated with seemingly correct ChatGPT answers.

Manual Analysis Data

The ChatGPT answers to SO question folder has 18 files containing 517 ChatGPT answers (labeled by 2 labelers). Each file contains the Question post ID and the title and link to the original StackOverflow question post, The ChatGPT answer labeled by the labelers, and overall feedback from the labelers. The labels for each answer mark different types of incorrectness, inconsistency, and conciseness issues at a fine-grained level. The overall feedback contains answer-level ratings including comprehensiveness and usefulness. Only the comprehensiveness of the answer-level rating is reported in the paper¹.

Each file is named in the following format-- Annotations_time_popularity_type.docx. For example, Annotations_new_61_3.docx contains answers to SO questions that are new, not popular, and debugging type.
Please refer to the paper¹ for the definition and criteria of each category.

Type: 1- Conceptual, 2- How to, 3- Debugging
Popularity: 61- Not Popular, 1500- Avg. Popular, 5000-Popular
Time: Old- Before November 2022, New- After November 2022
The Codebooks folder contains codebooks that were used to label the answers. The CodeBook for Non-Code Answer.pdf contains the codes and definitions for labeling sentences in non-code answers generated by ChatGPT. The CodeBook_code.pdf contains the codes and definitions for labeling the types of incorrectness in code examples embedded in ChatGPT-generated answers.

Linguistic Analysis Data

The Linguistic Analysis Data folder contains ChatGPT and SO answers to 2000 StackOverflow questions. The Id field in chatgpt_answers.csv is equivalent to the ParentId in the SO_answers.csv file. Both of these fields represent the Id of the StackOverflow questions.
The LIWC folder contains the occurrence frequencies of psycholinguistic lexicons in all 2000 answers (both ChatGPT and SO) computed by LIWC 2015.
The Sentiment Analysis folder contains the final sentiment label outputs for ChatGPT and SO answers as predicted by the Sentiment Analysis Model².

_1.
_{2. Cloudy1225/stackoverflow-roberta-base-sentiment}

SamiaKabir / ChatGPT-Answers-to-SO-questions

readme

Analysis of ChatGPT answers to SO questions

Manual Analysis Data

Linguistic Analysis Data