DanielRapp / twss.js

A node.js "that's what she said" classifier
MIT License
694 stars 34 forks source link

Normal sentence data is biased #6

Open DanielRapp opened 12 years ago

DanielRapp commented 12 years ago

The data for normal sentences (collected from fmylife.com) is biased towards the word "was" (and probably a lot of other things). A good resource for new normal sentences may be wikipedia.

entaroadun commented 12 years ago

Wikipedia will not work either. You need to take sentences from twitter that are not fallowed by TWSS. Positive and negative examples have to be at least little bit similar otherwise you are creating a classifier that will distinguish between sentences from Twitter and Wiki.

DanielRapp commented 12 years ago

Sentences that was and wasn't replied with "twss" or "that's what she said" on twitter has been collected. But the positive sentences are way too noisy to be usable.

Right now the sentences that can be replied with "that's what she said" all come from http://twssstories.com/

entaroadun commented 12 years ago

How about if you created classifier that distinguishes between FML (http://fmylife.com) and TWSS (http://twssstories.com/) stories? Both sources have well curated data if you use only those examples above certain "like" or "I agree, your life sucks" + "you deserved it" threshold. Or you could use simply best 2000 examples from each website.

Normalize all vectors to 1 [ x := x/sqrt(sum(x*x)) ]. This way for any two sentences the scalar product will be between 0 and 1.

For any new sentence you check first 20 nearest neighbors. if the number of FML ~ TWSS neighbors is similar then it means its neither FML nor TWSS but if you get FML >> TWSS then you label it FML, if TWSS >> FML then you label it with TWSS.

The data will be noisy by its nature. If you get accuracy > 0.6 you can stop optimizing your algorithm.