Open BonShillings opened 9 years ago
got 50% using BayesNet just now Used all features, and then wordToVector on the string itself finally some improvement
I'm very close to getting the bag of words implementation working. It's just a matter of cleaning up tokens so they don't make weka cry.
On Tue, Mar 31, 2015 at 9:53 PM, BonShillings notifications@github.com wrote:
got 50% using BayesNet just now Used all features, and then wordToVector on the string itself finally some improvement
— Reply to this email directly or view it on GitHub https://github.com/CraigBryan/tweet-mood-analyzer/issues/8#issuecomment-88311129 .
Ok. Just pushed the bag of words implementation. Try that out and see if it is worse or better.
Hey the BOW features were added with perfect format.
These is an issue with space however. I think some terms need to be trimmed. I think some Zipf distribution theory might help us. like we can trim the words that occur very infrequently. (i.e words that appear <= 3 documents, or something like this.
*I implemented this with the last push and was able to get 53% precission using naive bays complement
So with the new features like q_marks, e_marks, pos_score, etc, I still haven't been able to outperform the baseline performance of 45%, which was achieved using J48 and the StringToWordVectorFilter. Using q_marks, etc matches this performance. Even combinations of STWV and the above features achieves around the baseline performance.
I'm really not sure of how to increase the quality of the results over the baseline. Hopefully, new features will help, but maybe it would be best to look into other Weka algorithms... I've tried SMO (SVM) J48 (decision tree) and Naive Bayes, and they have similar performance.