brendano / ark-tweet-nlp

CMU ARK Twitter Part-of-Speech Tagger
http://www.ark.cs.cmu.edu/TweetNLP/
Other
574 stars 199 forks source link

emoticon confused with RT #19

Open chenhaot opened 11 years ago

chenhaot commented 11 years ago

for instance, pls RTTell will be parsed to pls R TT ell

I have an ad-hoc fix for now. It seems OK to me.

brendano commented 11 years ago

Hm. Do you have other examples? Is it always with the double underscore?

Please send us a pull request with your fix if you can. To test a fix to the tokenizer, what we do is run the old and new version on 100,000 tweets, then look at the differences if any.