to_bag_of_terms unexpected behavior

chartbeat-labs / textacy

NLP, before and after spaCy

Other

2.21k stars 249 forks source link

steps to reproduce

initialize doc with text containing the word "didn't"
doc._.to_bag_of_terms( ngrams=(1, 2, 3), weighting="count", normalize=None, as_strings=True, filter_stops = False, filter_punct = True, filter_nums = False )

expected vs. actual behavior

Actual behavior: to_bag_of_terms outputs the word "didn't" as: "didn't", "did", "n't". Expected behavior: to_bag_of_terms outputs the word "didn't" as: "didn't"

context

Trying to count n_grams in a block of text. If the block of text contains the word "didn't" and "did", it double counts "did". This also messes up counting n_grams because it counts "n't do" as a bigram instead of "didn't do".

environment

operating system: linux
python version: 3.7.9
spacy version: 2.3.4
installed spacy models: en_core_web_sm-2.3.1
textacy version: 0.10.1

Thanks :-)

Hi @nealonhager , I think there are a couple things going on here.

First, the word "didn't" (and like contractions) is typically tokenized into "did" and "n't" since they are, in a sense, two separate words mushed together. textacy relies on spacy's tokenization:

>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = nlp("didn't")
>>> [tok for tok in doc]
[did, n't]

Second, ngrams=(1, 2, 3) tells the to_bag_of_terms() function to include the combination of unigrams, bigrams, and trigrams in the bag of terms (see here), which by definition will result in overlapping representations of adjacent terms. Thus, the terms "did" and "n't" are the unigrams and "didn't" is the bigram that result from the original "didn't".

This is all standard and expected behavior. If it doesn't work for your use case, there is functionality for merging tokens: check out merge_spans() here. Using the simple example above:

>>> textacy.spacier.utils.merge_spans([doc[0:2]], doc)
>>> [tok for tok in doc]
[didn't]

chartbeat-labs / textacy