Open DanielRapp opened 12 years ago
Wikipedia will not work either. You need to take sentences from twitter that are not fallowed by TWSS. Positive and negative examples have to be at least little bit similar otherwise you are creating a classifier that will distinguish between sentences from Twitter and Wiki.
Sentences that was and wasn't replied with "twss" or "that's what she said" on twitter has been collected. But the positive sentences are way too noisy to be usable.
Right now the sentences that can be replied with "that's what she said" all come from http://twssstories.com/
How about if you created classifier that distinguishes between FML (http://fmylife.com) and TWSS (http://twssstories.com/) stories? Both sources have well curated data if you use only those examples above certain "like" or "I agree, your life sucks" + "you deserved it" threshold. Or you could use simply best 2000 examples from each website.
Normalize all vectors to 1 [ x := x/sqrt(sum(x*x)) ]. This way for any two sentences the scalar product will be between 0 and 1.
For any new sentence you check first 20 nearest neighbors. if the number of FML ~ TWSS neighbors is similar then it means its neither FML nor TWSS but if you get FML >> TWSS then you label it FML, if TWSS >> FML then you label it with TWSS.
The data will be noisy by its nature. If you get accuracy > 0.6 you can stop optimizing your algorithm.
The data for normal sentences (collected from fmylife.com) is biased towards the word "was" (and probably a lot of other things). A good resource for new normal sentences may be wikipedia.