Closed scayze closed 3 years ago
Hi @scayze , apologies for leaving you hanging for so long — I've been busy working on other projects for the past few months.
The cause of your issue is that Vectorizer.fit()
requires a nested sequence of strings as inputs, but you're giving it a nested sequence of spacy.Span
objects:
>>> [list(extract_terms(doc)) for doc in docs]
[[peter, loves, icecream, ducks],
[ducks, like, icecream],
[icecream, loves, peter],
[like, ducks]]
>>> [[term.lemma_ for term in extract_terms(doc)] for doc in docs]
[['peter', 'love', 'icecream', 'duck'],
['duck', 'like', 'icecream'],
['icecream', 'love', 'peter'],
['like', 'duck']]
This is documented (here, for example) and the type annotations are correct, but it seems like an easy error to make. I'll see if I can add some checks and/or more useful error messaging around this.
Hello! Thanks for maintaining this amazing library. I ran into an issue where setting min_df (in my example to 2) in the Vectorizer raises the error:
ValueError: After filtering, no terms remain; try a lower
min_dfor higher
max_df` As can be seen in the example below, theres many words that appear more than twice in the documents, and thus the error should not appear.steps to reproduce
Minimal example to reproduce the issue:
expected vs. actual behavior
possible solution?
context
environment
Python: 3.9.2 Windows 10 Package list: