chrisjmccormick / simsearch

Python tools for performing similarity searches on text documents.
MIT License
25 stars 17 forks source link

Error in computing dot product after changing input text documents #2

Open CaptainDroid opened 6 years ago

CaptainDroid commented 6 years ago

The provided sample data works just fine and gives the desired results, Although, when I provide different plain text documents (data generated by Faker) to train the model, and run the runSearchByText.py with a custom input text in input.txt, it gives an error in computing the dot product between the word_weights and vec2_lsi .

Here's the error that I am getting: word_sims[word_id] = vec1_tfidf[word_id] * np.dot(word_weights, vec2_lsi) / norms; ValueError: shapes (10,) and (300,) not aligned: 10 (dim 0) != 300 (dim 0)

I know that the reason for this is that dimensions of both the vectors are not the same (10 & 300), but this happens only with my provided data, the sample data works just fine. What am I doing wrong?

Mustyy commented 4 years ago

Any idea how you overcame this issue? When I run python parseMHC.py, I get the below: any ideas why?

SEC Financial Documents... Done. Building corpus...

Vocabulary contains 0 unique words. Corpus contains 0 "documents" represented by tf-idf vectors.

Training LSI... Traceback (most recent call last): File "parseMHC.py", line 57, in ssearch.trainLSI(num_topics=300)

File "C:\Users\ibrahimm\Desktop\simsearch\simsearch.py", line 50, in trainLSI self.index = similarities.MatrixSimilarity(self.lsi[self.ksearch.corpus_tfidf], num_features=num_topics) File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\lsimodel.py", line 568, in getitem assert self.projection.u is not None, "decomposition not initialized yet" AssertionError: decomposition not initialized yet