Module 'textacy' has no attribute 'Vectorizer'`, following Quickstart docs

thepartisan101 commented 4 years ago

Duplicate from issue #192 (solved in 2018) but cannot apply solution on textacy 0.10 Hi, I'm using Textacy on Google Colab environment, running a default hosted runtim, installed through !pip3 install textacy I've run into the same problem as the OP in #192 , I'm following the official document examples, version installed is the latest (textacy-0.10.0).

When trying dir(textacy) this the output: ['Corpus', 'DEFAULT_DATA_DIR', 'TextStats', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', 'about', 'cache', 'constants', 'corpus', 'datasets', 'extract', 'io', 'ke', 'lang_utils', 'load_spacy_lang', 'logger', 'logging', 'make_spacy_doc', 'network', 'preprocessing', 'set_doc_extensions', 'similarity', 'spacier', 'text_stats', 'text_utils', 'utils', 'vsm']

textacy.vsm.Vectorizer results in: textacy.vsm.vectorizers.Vectorizer

but textacy.Vectorizer gives: `AttributeError Traceback (most recent call last)

in () ----> 1 textacy.Vectorizer **AttributeError: module 'textacy' has no attribute 'Vectorizer'`** Thanks for any help!

bdewilde commented 4 years ago

Hi @thepartisan101, the Vectorizer isn't imported as a top-level object in v0.10, so this —

>>> import textacy
>>> textacy.Vectorizer
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-60485ad0cd96> in <module>
----> 1 textacy.Vectorizer

AttributeError: module 'textacy' has no attribute 'Vectorizer'

— is expected behavior. To get that class, do

>>> import textacy.vsm
>>> textacy.vsm.Vectorizer

It sounds like you saw something in the documentation that suggested otherwise. Could you point me to it?

thepartisan101 commented 4 years ago

Thanks a lot @bdewilde ! That worked. I was following this: https://textacy.readthedocs.io/en/stable/getting_started/quickstart.html#analyze-a-corpus

You can transform a corpus into a document-term matrix, with flexible tokenization, weighting, and filtering of terms:

>>> import textacy.vsm  # note the import
>>> vectorizer = textacy.Vectorizer(
...     tf_type="linear", apply_idf=True, idf_type="smooth", norm="l2",
...     min_df=2, max_df=0.95)
>>> doc_term_matrix = vectorizer.fit_transform(
...     (doc._.to_terms_list(ngrams=1, entities=True, as_strings=True)
...      for doc in corpus))
>>> print(repr(doc_term_matrix))
<1240x12577 sparse matrix of type '<class 'numpy.float64'>'
    with 217067 stored elements in Compressed Sparse Row format>

All I had to do was vectorizer = textacy.vsm.Vectorizer()

bdewilde commented 4 years ago

great, glad that worked! two things:

you've been referring to old docs. i used to host them at readthedocs, but switched over to github pages: https://chartbeat-labs.github.io/textacy/build/html/getting_started/quickstart.html#analyze-a-corpus
the newer docs also have this typo — will fix! 😅

thepartisan101 commented 4 years ago

Good to know, thanks!

bdewilde commented 4 years ago

You'll be happy to hear that somebody fixed this for me in #302 — sorry to drop the ball! 🤦 Closing this issue out now.

chartbeat-labs / textacy

Module 'textacy' has no attribute 'Vectorizer'`, following Quickstart docs #300