Ch 1 - remove stopwords in Pipeline -- need to add split()?

blueprints-for-text-analytics-python / blueprints-text

Jupyter notebooks for our O'Reilly book "Blueprints for Text Analysis Using Python"

Apache License 2.0

248 stars 139 forks source link

Ch 1 - remove stopwords in Pipeline -- need to add split()? #21

Closed dimabear123 closed 2 years ago

dimabear123 commented 2 years ago

I noticed i do not get the same results as the book or GitHub if I do not add a .split() in the remove stop function.

Repo and Book:

def remove_stop(tokens): return [t for t in tokens if t.lower() not in stopwords]

However, that ends up separating each character, not the word itself. here's what worked for me:

def remove_stop(tokens): return [t for t in tokens.split() if t.lower() not in stopwords]

My tokenize and prepare are identical to yours. Just the remove_stop function i needed to change by adding the .split()

jsalbr commented 2 years ago

Just had everything running on Colab, still works fine. Note that remove_stop already expects a list of tokens as parameter, not a string. So you should call tokenize first and then feed the output into remove_stop. That's what happens later in the pipeline with the following definition:

pipeline = [str.lower, tokenize, remove_stop]

dimabear123 commented 2 years ago

Thank you!