brendano / ark-tweet-nlp

CMU ARK Twitter Part-of-Speech Tagger
http://www.ark.cs.cmu.edu/TweetNLP/
Other
575 stars 199 forks source link

Use twitter-text to extract hashtags, mentions, and URLs #44

Open jrnold opened 7 years ago

jrnold commented 7 years ago

Currently the tokenizer has it's own regex's for hashtags, mentions, and URLs (and there's a comment about what the best URL pattern is). Twitter maintains a java library twitter-text that can extract these and handles all sorts of weird edge-cases. It also has a pretty good regex for getting URLs that aren't preceded by a protocol. Offloading the identification of the twitter-specific tokens to the twitter-maintained library would probably improve the identification of those items (or at the very least, mean it's making the same mistakes as Twitter itself)

brendano commented 7 years ago

It would be great to see a diff of tokenization under twokenize's current rules, versus what it is when using twitter-text's rules.