Closed Andyccs closed 8 years ago
Address 1st problem.
We initially planned to use text-processing API to labelled all our data, that's why I tot we only need the label, without probability. I decided that formatter.py
is not required. It turns out later that, we only use text-processing API to labelled some data as our train data.
I have modified classify.py
in cbdac0ed43ab3a04aefefd126e38f11429c76aa4 to output the correct espn_data_result.json
that we need. In later commit, we should use the trained model to label all tweets.
Address 4th problem. As discussed offline, we will have label_1.csv
and label_2.csv
.
Address 5th problem. Resolved offline
Thanks a lot for the enhancements. Good to note there are empty strings as tweets after preprocessing now.
Regarding problem 7,
7: Updated: 26 March 2016 I do some analysis on the training data. We have 2002 neutral data, 625 negative data, and 583 positive data. Let say now I have a stupid classifier that classify any data as neutral, then we have an accuracy of 62.36%. If you look at any classifier we have now, the accuracy is around this value.
We can use the f1-measure to indicate the 'best' performing classifier. I will do a comparison in the report.
I am not sure whether we can justify it in this way. If I use a ZeroR classifier (a classifier that just classify everything as one class), I get the following result
Compare to our tfidf_linear_svc:
Which one do your think is better? Apparently is ZeroR, right?
The best way to solve this problem is by using a better classifier, or a better feature extraction methods
I did a literature review on the research paper, Agarwal, Apoorv et al. "Sentiment analysis of twitter data." Proceedings of the workshop on languages in social media 23 Jun. 2011: 30-38, where they reported state-of-the-art results. It was interesting to note that the author pointed out that Twitter-specific features (emoticons, hashtags etc.) add value to the classifier but only marginally. In our case, we simply remove most of these features. The difference between our methods is mostly the preprocessing and the feature engineering step. The POS-specific prior polarity features and tree kernel was proposed for feature engineering, while we used Tf-idf. After consideration, Tf-idf might not have been a good method for this context as it might be more suitable for larger documents. The classifier used in the experiment, support vector machine (SVM) and 5-fold cross validation, was similar to what we tried. The best result for binary classification looks promising (accuracy=75.39, f1-Pos=74.81, f1-Neg=75.86) while the best 3-way classification had a poorer result (accuracy=60.50, f1-Pos=59.41, f1-Neu=60.15, f1-Neg=61.86). I listed the key steps to reproduce the results in our google drive folder.
I am reviewing codes in
clasiffier
folder. I will update this issue from time to time until I finish reviewing.espn_data_result.json
at Kavan side is not reproducible by me. The reason is due to codes inclassify.py
, I will change this.espn
data to train our classifier. I will changeclassify.py
andmain.py
to only work onespn
data first. After that, in future commit, I will make other data work for all codes too.labelled_tweet.csv
, the filename is not so intuitive, since it only contains tweets without labels. Change it totweets.csv
?label_1.csv
filename not intuitive, do we havelabel_2.csv
? Else, change it to some other nameslabel_1.csv
, which will later on make our classifier worse. We should use the original values for training.labelled_tweet.csv
, the content is....not really pre-processed. Many can be done. For example, we still have links in our training data, mentions and tags (which could be normalize to normal text), etc. I do not know whether doing so will improve accuracy.