Closed dselivanov closed 7 years ago
Would it make get_similar_pairs
more RAM friendly? Or is it related to get_signature_matrix
? Or both?
Right now, I have some memory issues with get_similar_pairs
, because I want to recover most doc (tf-idf) with cos distance around 0.4. According to the S curve, I need around 400 projections and 80 bands to recover over 50% of the documents.
I am thinking to reduce the dictionary size to make tfidf matrix smaller (right now 50K). It should make docs more similar (not sure about the intuition)... Any other strategy to deal with that?
Edit: bad intuition, decreasing dic size makes less missing words in common -> make things worse.
@pommedeterresautee this can affect only get_signature_matrix
.
The main problem is with your threshold. cos = 0.4
is very small... Actually documents won't be noticeably similar. I'm wondering what is the application? Can't imagine application when it can be useful.
At the moment we store:
We should not store this matrices at all - do hashing on the fly and keep only SEED.