dselivanov / LSHR

Locality Sensitive Hashing In R
Other
41 stars 13 forks source link

find similar within DB for a query #17

Closed reisner closed 5 years ago

reisner commented 5 years ago

From what I understand, the get_similar_pairs_cosine function finds pairs within a dataset. However, what if we are finding similarity between a query and the dataset? Would be nice to have this functionality here. Do you have plans to update this package, or know of others that do this?

dselivanov commented 5 years ago

No plans for that. How big is the query? May be it makes sense to compute exact similarity between query and db? This is a matter of a one matrix multiplication.

пт, 16 авг. 2019 г., 20:52 Roman Eisner notifications@github.com:

From what I understand, the get_similar_pairs_cosine function finds pairs within a dataset. However, what if we are finding similarity between a query and the dataset? Would be nice to have this functionality here. Do you have plans to update this package, or know of others that do this?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dselivanov/LSHR/issues/17?email_source=notifications&email_token=ABHC5XKNEPVAMVHE2F2ON43QE3LM7A5CNFSM4IMLCDZKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HFWDLSQ, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHC5XJ4BLUSULDW7FGRSDDQE3LM7ANCNFSM4IMLCDZA .

reisner commented 5 years ago

Yeah, it's just a single query, but a very large DB. I would like it to be quick, in realtime. Problem is the database is going to be 10s of millions of rows, and this starts to be slow at this point.

dselivanov commented 5 years ago

My suggestion is to benchmark. As I remember I had similar task - query against 20m book titles. And I've started to build LSH based retrieval. But at the end switched to brute force since it was faster.

пт, 16 авг. 2019 г., 20:59 Roman Eisner notifications@github.com:

Yeah, it's just a single query, but a very large DB. I would like it to be quick, in realtime. Problem is the database is going to be 10s of millions of rows, and this starts to be slow at this point.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dselivanov/LSHR/issues/17?email_source=notifications&email_token=ABHC5XOUAGVQE37BQCT44TTQE3MHDA5CNFSM4IMLCDZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4PEZXY#issuecomment-522079455, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHC5XIJ2K7OUNSZSMT5RPDQE3MHDANCNFSM4IMLCDZA .

reisner commented 5 years ago

Oh, I did benchmark, that's why I'm here :) No problem if you're not planning on implementing this, thanks!

dselivanov commented 5 years ago

Is your db sparse or dense matrix?

reisner commented 5 years ago

dense matrix, it's a word embedding database.

dselivanov commented 5 years ago

That's easy then - take a look to rcppannoy or https://github.com/jlmelville/rcpphnsw

reisner commented 5 years ago

Thanks for that!