SentimentAnalyzer on Amazon Reviews Dataset

chinglamchoi commented 4 years ago

I used the Sentiment Analyzer model to perform binary classification on the Amazon Reviews Dataset. Before training, I perform the following steps for pre-processing:

truncate input at 500 chars
strip stopwords
strip corrupt utf8 chars (iso-8859-1 chars)
stemming to root words

The following are inference results: Accuracy: 49.64325 Precision: 0.497469903015904 Recall: 0.701445 F1 Score: 0.5821059947510917

I also compared the accuracy (of TextAnalysis' model pretrained on the IMDB dataset) with a logistic model (trained on 12000 reviews of the Amazon Reviews trainset) in sklearn. The sklearn model scored 46.47175 in accuracy.

To improve on Sentiment Analyzer's accuracy, I think that part of speech tagging could be implemented. However, it is at the moment very time-consuming to perform, taking up to 24 hours for pre-processing on 10000 reviews (the entire testset has 400000 samples), which made it infeasible to test in Google Code In!

aviks commented 3 years ago

@chinglamchoi do you have the code for this exercise available somewhere?

chinglamchoi commented 3 years ago

Yes here it is: https://github.com/chinglamchoi/GCI_With_Julia/tree/master/Machine_Learning/sentiment_analysis

Due to time limitations of GCI, I believe the reported accuracy values in the issue were between sklearn and TextAnalysis.jl at 4000 and <4000 epochs respectively (not enough time to train). Sklearn achieved higher accuracies than 0.4647 with >4000 training samples.

JuliaText / TextAnalysis.jl

SentimentAnalyzer on Amazon Reviews Dataset #191