Andyccs / sport-news-retrieval

MIT License
6 stars 2 forks source link

Classification Problems #7

Closed Andyccs closed 8 years ago

Andyccs commented 8 years ago

I am reviewing codes in clasiffier folder. I will update this issue from time to time until I finish reviewing.

  1. It seems like espn_data_result.json at Kavan side is not reproducible by me. The reason is due to codes in classify.py, I will change this.
  2. Overall, we currently just use espn data to train our classifier. I will change classify.py and main.py to only work on espn data first. After that, in future commit, I will make other data work for all codes too.
  3. labelled_tweet.csv, the filename is not so intuitive, since it only contains tweets without labels. Change it to tweets.csv?
  4. label_1.csv filename not intuitive, do we have label_2.csv? Else, change it to some other names
  5. We use some lazy methods to measure kappa. I don't mind. However, we output the modified labels to label_1.csv, which will later on make our classifier worse. We should use the original values for training.
  6. If you look at labelled_tweet.csv, the content is....not really pre-processed. Many can be done. For example, we still have links in our training data, mentions and tags (which could be normalize to normal text), etc. I do not know whether doing so will improve accuracy.
  7. Updated: 26 March 2016 I do some analysis on the training data. We have 2002 neutral data, 625 negative data, and 583 positive data. Let say now I have a stupid classifier that classify any data as neutral, then we have an accuracy of 62.36%. If you look at any classifier we have now, the accuracy is around this value.
kklw commented 8 years ago
  1. Code to get that file is in formatter.py deleted by @Andyccs. 0d45defaa2e17dfa4383c049e04ca37184fe8f24
  2. ok
  3. I named it that to indicate tweets that are supposed to be labelled by us. We are only required to label 10% of the data i think. After, the previous discussion we decided to use the whole of the espn data. Will need to crawl the rest of the 90%.
  4. They are labels labelled by 'person 1'. Should have had 2 sets labelled by 2 people. You can change the file name too.
  5. Should be using the labels that are labelled by us. Hence, I used the labels there.
  6. Agreed, preprocessing can be improve.
Andyccs commented 8 years ago

Address 1st problem.

We initially planned to use text-processing API to labelled all our data, that's why I tot we only need the label, without probability. I decided that formatter.py is not required. It turns out later that, we only use text-processing API to labelled some data as our train data.

I have modified classify.py in cbdac0ed43ab3a04aefefd126e38f11429c76aa4 to output the correct espn_data_result.json that we need. In later commit, we should use the trained model to label all tweets.

Andyccs commented 8 years ago

Address 4th problem. As discussed offline, we will have label_1.csv and label_2.csv.

Address 5th problem. Resolved offline

kklw commented 8 years ago

Thanks a lot for the enhancements. Good to note there are empty strings as tweets after preprocessing now.

kklw commented 8 years ago

Regarding problem 7,

7: Updated: 26 March 2016 I do some analysis on the training data. We have 2002 neutral data, 625 negative data, and 583 positive data. Let say now I have a stupid classifier that classify any data as neutral, then we have an accuracy of 62.36%. If you look at any classifier we have now, the accuracy is around this value.

We can use the f1-measure to indicate the 'best' performing classifier. I will do a comparison in the report.

Andyccs commented 8 years ago

I am not sure whether we can justify it in this way. If I use a ZeroR classifier (a classifier that just classify everything as one class), I get the following result

Compare to our tfidf_linear_svc:

Which one do your think is better? Apparently is ZeroR, right?

Andyccs commented 8 years ago

The best way to solve this problem is by using a better classifier, or a better feature extraction methods

kklw commented 8 years ago

I did a literature review on the research paper, Agarwal, Apoorv et al. "Sentiment analysis of twitter data." Proceedings of the workshop on languages in social media 23 Jun. 2011: 30-38, where they reported state-of-the-art results. It was interesting to note that the author pointed out that Twitter-specific features (emoticons, hashtags etc.) add value to the classifier but only marginally. In our case, we simply remove most of these features. The difference between our methods is mostly the preprocessing and the feature engineering step. The POS-specific prior polarity features and tree kernel was proposed for feature engineering, while we used Tf-idf. After consideration, Tf-idf might not have been a good method for this context as it might be more suitable for larger documents. The classifier used in the experiment, support vector machine (SVM) and 5-fold cross validation, was similar to what we tried. The best result for binary classification looks promising (accuracy=75.39, f1-Pos=74.81, f1-Neg=75.86) while the best 3-way classification had a poorer result (accuracy=60.50, f1-Pos=59.41, f1-Neu=60.15, f1-Neg=61.86). I listed the key steps to reproduce the results in our google drive folder.