chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

Add additional flags to doc_extensions.to_bag_of_words #249

Closed kjoshi closed 5 years ago

kjoshi commented 5 years ago

Description

doc_extensions.to_bag_of_words has be modified to include additional flags that enable the user to decide whether or not to include/exclude stop words, punctuation and/or spaces from the word counts.

Corpus.words_counts and Corpus.word_doc_counts have also been updated to pass through the relevant flags to doc_extensions.to_bag_of_words.

Motivation and Context

Sometimes it may be of interest to keep track of stop words, punctuation and/or spaces when converting a doc to a bag of words.

How Has This Been Tested?

> d = "This is a test. This, here, is another test"
> doc = textacy.make_spacy_doc(d)

> textacy.spacier.doc_extensions.to_bag_of_words(doc, normalize="", as_strings=True)
{'test': 2}

> textacy.spacier.doc_extensions.to_bag_of_words(doc, normalize="", as_strings=True, remove_stop=False)
{'is': 2, 'This': 2, 'a': 1, 'here': 1, 'another': 1, 'test': 2}

textacy.spacier.doc_extensions.to_bag_of_words(doc, normalize="", as_strings=True, remove_stop=False, remove_punct=False)
{'!': 1, 'is': 2, 'This': 2, '.': 1, 'a': 1, 'here': 1, 'another': 1, 'test': 2, ',': 2}

Types of changes

Checklist:

kjoshi commented 5 years ago

Hi @bdewilde, Thanks for the comments and suggestions. I've rebased this pull request onto the develop branch and have made the other changes you suggested.

Let me know if there's anything else you'd like me to tweak.