chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.22k stars 250 forks source link

Setting Vectorizer.min_df results in empty vocabulary where it shouldnt. #339

Closed scayze closed 3 years ago

scayze commented 3 years ago

Hello! Thanks for maintaining this amazing library. I ran into an issue where setting min_df (in my example to 2) in the Vectorizer raises the error: ValueError: After filtering, no terms remain; try a lowermin_dfor highermax_df` As can be seen in the example below, theres many words that appear more than twice in the documents, and thus the error should not appear.

steps to reproduce

Minimal example to reproduce the issue:

import textacy
from textacy import extract
from textacy.representations.vectorizers import Vectorizer

def extract_terms(doc):
    return extract.terms(  
        doc,
        ngs=1,
        ents=True
    )

en = textacy.load_spacy_lang("en_core_web_sm")

textlist = [
    "peter loves icecream and ducks",
    "ducks like icecream",
    "icecream loves peter",
    "be like ducks"
]
docs = [textacy.make_spacy_doc(t, lang=en) for t in textlist]

vectorizer = Vectorizer(
    tf_type='linear',
    idf_type='standard',
    norm="l2",
    min_df=2,
)

extracted_terms = (extract_terms(doc) for doc in docs)
vectorizer = vectorizer.fit(extracted_terms) #ValueError: After filtering, no terms remain; try a lower `min_df` or higher `max_df`
result = vectorizer.transform(extracted_terms)

expected vs. actual behavior

possible solution?

context

environment

Python: 3.9.2 Windows 10 Package list:

aplus==0.11.0
attrs==21.2.0
backcall==0.2.0
blis==0.7.4
cachetools==4.2.2
catalogue==2.0.4
certifi==2020.12.5
cffi==1.14.5
chardet==4.0.0
click==7.1.2
cloudpickle==1.6.0
cmake==3.18.4.post1
colorama==0.4.4
cycler==0.10.0
cymem==2.0.5
cytoolz==0.11.0
dask==2021.5.0
decorator==4.4.2
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl
Flask==2.0.0
frozendict==2.0.2
fsspec==2021.5.0
future==0.18.2
h5py==3.2.1
idna==2.10
ipykernel==5.5.5
ipython==7.24.0
ipython-genutils==0.2.0
itsdangerous==2.0.0
jedi==0.18.0
jellyfish==0.8.2
Jinja2==3.0.0
joblib==1.0.1
jupyter-client==6.1.12
jupyter-core==4.7.1
kiwisolver==1.3.1
locket==0.2.1
MarkupSafe==2.0.0
matplotlib==3.4.2
matplotlib-inline==0.1.2
MulticoreTSNE==0.1
murmurhash==1.0.5
nest-asyncio==1.5.1
networkx==2.5.1
nltk==3.6.2
numpy==1.20.3
packaging==20.9
pandas==1.2.4
parso==0.8.2
partd==1.2.0
pathy==0.5.2
pickleshare==0.7.5
Pillow==8.2.0
preshed==3.0.5
progressbar2==3.53.1
prompt-toolkit==3.0.18
psutil==5.8.0
pyarrow==4.0.0
pycparser==2.20
pydantic==1.7.4
Pygments==2.9.0
pymongo==3.11.4
pyparsing==2.4.7
Pyphen==0.10.0
python-dateutil==2.8.1
python-utils==2.5.6
pytz==2021.1
pywin32==300
PyYAML==5.4.1
pyzmq==22.1.0
regex==2021.4.4
requests==2.25.1
scikit-learn==0.24.2
scipy==1.6.3
six==1.16.0
sklearn==0.0
smart-open==3.0.0
spacy==3.0.6
spacy-legacy==3.0.5
srsly==2.4.1
tabulate==0.8.9
textacy==0.11.0
thinc==8.0.3
threadpoolctl==2.1.0
toolz==0.11.1
tornado==6.1
tqdm==4.60.0
traitlets==5.0.5
typer==0.3.2
urllib3==1.26.4
vaex-core==4.1.0
vaex-hdf5==0.7.0
wasabi==0.8.2
wcwidth==0.2.5
Werkzeug==2.0.0
bdewilde commented 3 years ago

Hi @scayze , apologies for leaving you hanging for so long — I've been busy working on other projects for the past few months.

The cause of your issue is that Vectorizer.fit() requires a nested sequence of strings as inputs, but you're giving it a nested sequence of spacy.Span objects:

>>> [list(extract_terms(doc)) for doc in docs]
[[peter, loves, icecream, ducks],
 [ducks, like, icecream],
 [icecream, loves, peter],
 [like, ducks]]
>>> [[term.lemma_ for term in extract_terms(doc)] for doc in docs]
[['peter', 'love', 'icecream', 'duck'],
 ['duck', 'like', 'icecream'],
 ['icecream', 'love', 'peter'],
 ['like', 'duck']]

This is documented (here, for example) and the type annotations are correct, but it seems like an easy error to make. I'll see if I can add some checks and/or more useful error messaging around this.