collab-uniba / pySenti4SD

Python implementation of Senti4SD
MIT License
12 stars 9 forks source link

pySenti4SD

Python implementation of Senti4SD. Senti4SD is an emotion polarity classifier specifically trained to support sentiment analysis in developers' communication channels. Senti4SD is trained and evaluated on a gold standard of over 4K posts extracted from Stack Overflow. It is part of the Collab Emotion Mining Toolkit, (EMTk).

Fair Use Policy

Please, cite the following paper if you intend to use our tool for your own research:

Calefato, F., Lanubile, F., Maiorano, F., Novielli N. (2018) "Sentiment Polarity Detection for Software Development," Empirical Software Engineering, 23(3), pp:1352-1382, doi: https://doi.org/10.1007/s10664-017-9546-9. (BibTeX)

How do I get set up?

Installation

NOTE: You will need to install dvc to check out this project. Once installed and initialized, simply the following:

git clone https://github.com/collab-uniba/pySenti4SD.git
cd pySenti4SD
dvc pull -r origin

Requirements

Usage

In the following, we show first how to train a new model for polarity classification and, then, how to test the model on unseen data.
For testing purposes, you can use the Sample.csv input file available in the root of the repo.

Train a new classification model

sh train.sh -i train.csv [-d csv_delimiter] [-g] [-c chunk-size] [-j jobs-number] [-o model-name]

or you can run the script with two separated datasets, one for training and the other for testing:

sh train.sh -i train.csv -i test.csv [-d csv_delimiter] [-g] [-c chunk-size] [-j jobs-number] [-o model-name]

where

As a result, the script will generate the following output files:

Classification task

sh classification.sh -i dataset.csv [-d csv_delimiter] [-g] [-t] [-m model-name] [-c chunk-size] [-j jobs-number] [-o predictions.csv]

where

As a result, the script will create a prediction-file-name.csv inside predictions folder containing:

  Polarity 
  …  
  positive
  negative
  …

or for example, in the case the input dataset contains a column named "ID" and the -t parameter is used, the predictions-file-name.csv will look like this:

  ID,Text,Polarity 
  …  
  21,"""@DrabJay: excellent suggestion! Code changed. :-)""",positive
  22,"""@IgnacioOcampo, I gave up after a while I am afraid :(""",negative
  …

For example, if you wanted to detect the polarity of the documents in the input file Sample.csv, you would have to run:

sh classification.sh -i Sample.csv -d sc