Open kuk opened 1 year ago
I think you brought up a great point. There needs to be some work in improving the hyper-parameter optimization code (takes the weights and assigns b
and r
for the index, see https://github.com/ekzhu/datasketch/blob/master/datasketch/lsh.py#L22). The current one is both data-agnostic and not tuned toward any specific recall requirement.
Ideas welcome!
I prepare 10 synthetic examples.
Value and query in each pair have Jaccard > 0.9
I insert all values in MinHashLSH, use default settings. For every query expect exactly one value. But in 4 / 10 cases get no results.
I change weights and get correct results
Is it an expected behavior? Maybe change default threshold or weights?