Preprocessing: better tokenization

MediaUncovered / NewsAnalysis

use word embeddings to uncover bias in newspapers

5 stars 1 forks source link

Preprocessing: better tokenization #11

Closed Tilana closed 7 years ago

Tilana commented 7 years ago

with the split() method punctuation is not separated from words, 'e.g. ['this', 'is', 'an', 'example!']. Check out nltk libraries for a better tokenization.

Tilana commented 7 years ago

use word_tokenize from nltk