ekzhu / datasketch

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
https://ekzhu.github.io/datasketch
MIT License
2.58k stars 294 forks source link

Store MinHashLSH in redis, when do the query operation it takes too long? #168

Open MrRace opened 2 years ago

MrRace commented 2 years ago

Hi, I build MinHashLSH like that:

self.lsh = MinHashLSH(
                    threshold=0.7
                    num_perm=128
                    storage_config={
                        'type': 'redis',
                        'basename': b'test_',
                        'redis': {'host': host_ip, 'port': host_port, 'password': host_password, 'db': db_num,
                                  },
                    }

When do query like that:

new_task_text="mytext"
new_text_hash = MinHash(num_perm=128)
new_text_hash.update_batch([s.encode('utf-8') for s in new_task_text])
newminhash_end_time = time.time()
query_start_time = time.time()
similar_text_ids = self.lsh.query(new_text_hash) 
query_end_time = time.time()
print("query_cost_time=", query_end_time-query_start_time)  # 28ms

the query operation cost 20ms, does it seems to take too long time? Is there any way to improve it? Thanks a lot!

ekzhu commented 2 years ago

It is using redis as external storage layer so there is overhead for sure depending on where your Redis instance is running. How about using the simple Python in-memory storage (i.e., without specifying any storage config)?