clulab / twitter4food

Repository for the health informatics analytics on twitter project
Apache License 2.0
1 stars 4 forks source link

Improve tokenization #4

Closed herongrove closed 7 years ago

herongrove commented 7 years ago

Ark Tweet does a good job, but tokenization remains poor for:

  1. emoji, e.g. commentary๐Ÿ˜‚๐Ÿ˜‚๐Ÿ˜‚๐Ÿ˜‚๐Ÿ˜‚omg๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ is a single token.
  2. some URLs, resulting in https tokens.
  3. words separated by / and similar

When tokenizing, do an additional manual check for these cases (and brainstorm others) and split/lump/delete as necessary.