Redesign hashing to be more RAM-friendly

dselivanov commented 9 years ago

At the moment we store:

entire "permitation" hash matrix for minhashing
entire random projections for sketching

We should not store this matrices at all - do hashing on the fly and keep only SEED.

pommedeterresautee commented 8 years ago

Would it make get_similar_pairs more RAM friendly? Or is it related to get_signature_matrix? Or both?

Right now, I have some memory issues with get_similar_pairs, because I want to recover most doc (tf-idf) with cos distance around 0.4. According to the S curve, I need around 400 projections and 80 bands to recover over 50% of the documents. I am thinking to reduce the dictionary size to make tfidf matrix smaller (right now 50K). It should make docs more similar (not sure about the intuition)... Any other strategy to deal with that?

Edit: bad intuition, decreasing dic size makes less missing words in common -> make things worse.

dselivanov commented 8 years ago

@pommedeterresautee this can affect only get_signature_matrix.

The main problem is with your threshold. cos = 0.4 is very small... Actually documents won't be noticeably similar. I'm wondering what is the application? Can't imagine application when it can be useful.

dselivanov / LSHR

Redesign hashing to be more RAM-friendly #2