SentimentAnalysisInSE / code_review_analysis

0 stars 0 forks source link

Text preprocessing #13

Open bcdasilv opened 5 years ago

bcdasilv commented 5 years ago

The overall process is: first, pre-process the comments (e.g. remove html tags and code snippets); second, adhere the text (review comments) to the sentiment/emotion analysis API in use (e.g. check comment length); third, make API calls and store the results.

In this issue, you're supposed to work on the preprocessing step. Screen over a few code review comments to see how they look like. They probably have a bunch of trash and noise for the sentiment/emotion analysis tools we use. For instance, html/markdown tags and code snippets.

Then, write a script to read code review comments from the dataset, preprocess them, and generate another version of the comments which is supposed to be ready for the sentiment/emotion analysis.

Make sure to store the processed comments.

This processing step might vary by the sentiment/emotion analysis approach in use. For instance, EMTk may remove html tags internally whereas IBM Tone Analyzer may not. This is something to be clarified as we work on the issue.