chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

In vectorizer.fit_transform() function, when tf_type="log" we get UFuncTypeError: Cannot cast ufunc 'add' output from dtype('float64') to dtype('int32') with casting rule 'same_kind' #288

Open rohetoric opened 4 years ago

rohetoric commented 4 years ago

steps to reproduce

  1. Read a text file.

  2. Set the value of the following parameters one by one tf_type=["linear", "sqrt", "log", "binary"] idf_type = ["standard", "smooth", "bm25"] dl_type= ["linear", "sqrt", "log"] norm =["l1", "l2"] models= ["lsa","lda","nmf"]

  3. Iterate with a nested loop along values of all 5 parameters and compute doc_term_matrix ie for t in tf_type: for i in idf_type: for d in dl_type: for n in norm: for mo in models: vectorizer = textacy.vsm.Vectorizer(tf_type=t, apply_idf=True, idf_type=i,dl_type=d, norm=n,min_df=2, max_df=0.95) doc_term_matrix = vectorizer.fit_transform((doc._.to_terms_list(ngrams=3, entities=True, as_strings=True)for doc in spacy_gram))

  4. When the tf_type="log", we receive the above error.

expected vs. actual behavior

possible solution?

I saw that inside the vectroizer.fit_transform there is a function _reweight_values(self, doc_term_matrix) function. When the tf_type="log", we read np.log(doc_term_matrix.data, doc_term_matrix.data, casting="unsafe"). Even though the casting has been declared as "unsafe", there is error is on the next line i.e doc_term_matrix.data += 1.0. I think it should be initialized as doc_term_matrix.data = doc_term_matrix.data+1.0 according to https://stackoverflow.com/questions/38673531/multiply-numpy-int-and-float-arrays-cannot-cast-ufunc-multiply-output-from-dtyp

context

I am trying to get clusters with similar intent according to my dataset and for that I need the document term matrix. I am just using the brute force method as to when I can receive the best silhouette score of the cluster based on tweaking the parameters of the vectorizer function in a loop.

environment

Receving an TypeError here in print_markdown(items) i.e.TypeError:smust be (<class 'str'>, <class 'bytes'>), not <class 'list'> inside the to_unicode(s, encoding, errors) function.