Closed dimabear123 closed 2 years ago
Just had everything running on Colab, still works fine. Note that remove_stop
already expects a list of tokens as parameter, not a string. So you should call tokenize
first and then feed the output into remove_stop
. That's what happens later in the pipeline with the following definition:
pipeline = [str.lower, tokenize, remove_stop]
Thank you!
I noticed i do not get the same results as the book or GitHub if I do not add a .split() in the remove stop function.
Repo and Book:
def remove_stop(tokens): return [t for t in tokens if t.lower() not in stopwords]
However, that ends up separating each character, not the word itself. here's what worked for me:
def remove_stop(tokens): return [t for t in tokens.split() if t.lower() not in stopwords]
My tokenize and prepare are identical to yours. Just the remove_stop function i needed to change by adding the .split()