Build a large-scale dataset of sentiment and emotion on code review comments

The initial attempt will be working with the Sourced dataset: https://github.com/src-d/datasets/tree/master/ReviewComments

The overall process is: first, pre-process the comments (e.g. remove html tags and code snippets); second, adhere the text (review comments) to the sentiment/emotion analysis API in use (e.g. check comment length); third, make API calls and store the results.

In terms of the dataset, our plan B can be the GH API. So we would be writing a script to consume the GH API to gather code review comments and their metadata, and then follow the process listed above. Another option would be GH Archive (but this is the one used by Sourced).

SentimentAnalysisInSE / code_review_analysis

Build a large-scale dataset of sentiment and emotion on code review comments #11