chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

to_bag_of_terms unexpected behavior #311

Closed nealonhager closed 3 years ago

nealonhager commented 3 years ago

steps to reproduce

  1. initialize doc with text containing the word "didn't"
  2. doc._.to_bag_of_terms( ngrams=(1, 2, 3), weighting="count", normalize=None, as_strings=True, filter_stops = False, filter_punct = True, filter_nums = False )

expected vs. actual behavior

Actual behavior: to_bag_of_terms outputs the word "didn't" as: "didn't", "did", "n't". Expected behavior: to_bag_of_terms outputs the word "didn't" as: "didn't"

context

Trying to count n_grams in a block of text. If the block of text contains the word "didn't" and "did", it double counts "did". This also messes up counting n_grams because it counts "n't do" as a bigram instead of "didn't do".

environment

Thanks :-)

bdewilde commented 3 years ago

Hi @nealonhager , I think there are a couple things going on here.

First, the word "didn't" (and like contractions) is typically tokenized into "did" and "n't" since they are, in a sense, two separate words mushed together. textacy relies on spacy's tokenization:

>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = nlp("didn't")
>>> [tok for tok in doc]
[did, n't]

Second, ngrams=(1, 2, 3) tells the to_bag_of_terms() function to include the combination of unigrams, bigrams, and trigrams in the bag of terms (see here), which by definition will result in overlapping representations of adjacent terms. Thus, the terms "did" and "n't" are the unigrams and "didn't" is the bigram that result from the original "didn't".

This is all standard and expected behavior. If it doesn't work for your use case, there is functionality for merging tokens: check out merge_spans() here. Using the simple example above:

>>> textacy.spacier.utils.merge_spans([doc[0:2]], doc)
>>> [tok for tok in doc]
[didn't]