Alir3z4 / stop-words

List of common stop words in various languages.
http://alir3z4.github.io/stop-words
Creative Commons Attribution 4.0 International
324 stars 243 forks source link

English Stop Words have additional character. #11

Closed JeMunos closed 7 years ago

JeMunos commented 7 years ago

Somehow the english library is outputting the letter u in front of each word.

stop_words = get_stop_words('english') print(en_stop) [u'a', u'about', u'above', u'after', u'again', u'against', u'all', u'am', u'an', u'and', u'any', u'are', u"aren't", u'as', u'at', u'be', u'because', u'been', u'before', u'being', u'below', u'between', u'both', u'but', u'by', u"can't", u'cannot', u'could', u"couldn't", u'did', u"didn't", u'do', u'does', u"doesn't", u'doing', u"don't", u'down', u'during', u'each', u'few', u'for', u'from', u'further', u'had', u"hadn't", u'has', u"hasn't", u'have', u"haven't", u'having', u'he', u"he'd", u"he'll", u"he's", u'her', u'here', u"here's", u'hers', u'herself', u'him', u'himself', u'his', u'how', u"how's", u'i', u"i'd", u"i'll", u"i'm", u"i've", u'if', u'in', u'into', u'is', u"isn't", u'it', u"it's", u'its', u'itself', u"let's", u'me', u'more', u'most', u"mustn't", u'my', u'myself', u'no', u'nor', u'not', u'of', u'off', u'on', u'once', u'only', u'or', u'other', u'ought', u'our', u'ours', u'ourselves', u'out', u'over', u'own', u'same', u"shan't", u'she', u"she'd", u"she'll", u"she's", u'should', u"shouldn't", u'so', u'some', u'such', u'than', u'that', u"that's", u'the', u'their', u'theirs', u'them', u'themselves', u'then', u'there', u"there's", u'these', u'they', u"they'd", u"they'll", u"they're", u"they've", u'this', u'those', u'through', u'to', u'too', u'under', u'until', u'up', u'very', u'was', u"wasn't", u'we', u"we'd", u"we'll", u"we're", u"we've", u'were', u"weren't", u'what', u"what's", u'when', u"when's", u'where', u"where's", u'which', u'while', u'who', u"who's", u'whom', u'why', u"why's", u'with', u"won't", u'would', u"wouldn't", u'you', u"you'd", u"you'll", u"you're", u"you've", u'your', u'yours', u'yourself', u'yourselves']

JeMunos commented 7 years ago

$pip show stop_words Name: stop-words Version: 2015.2.23.1 Summary: Get list of common stop words in various languages in Python Home-page: https://github.com/Alir3z4/python-stop-words Author: Alireza Savand Author-email: alireza.savand@gmail.com License: Copyright (c) 2014, Alireza Savand, Contributors Location: /usr/local/lib/python2.7/site-packages

Alir3z4 commented 7 years ago

@JeMunos Python 2 Unicode strings have the u in their prefix. The u is not part of the string itself, it's just the way to tell Python the string is an unicode string.

However in Python 3 Unicode handling is much nicer and doesn't have that kind of confusion.

Unicode in Python 2: https://docs.python.org/2/howto/unicode.html Unicode in Python 3: https://docs.python.org/3/howto/unicode.html

Does that makes sense ? I'll close this issue, feel free to re-open if that doesn't answer your question.

JeMunos commented 7 years ago

Ok that makes sense. Thank you for the clarification.