Closed ghost closed 4 years ago
For retrieving similar documents, you can use get_similar_groups()
method. To identify each document you add, you can build your own data structure to fulfill your needs.
A bit hacky but you can get use of the private class member _buckets
of SemanticSimHash
which is a dictionary consisting of bucket hash and document list pairs.
I was thinking to get the text's simhash and save it in the database, then add the doc.
It is gonna be quiet slow but the only solution available now.
What would be great is to return the simhash in the json when adding a document. It would save one process.
If we could return the text hash for each matched text in the search results it would solve this issue.
You mentioned a batch insert feature in the README.md, is it something you plan to add in the incoming days ? that would be awesome
I see. It will be better to return hash of its bucket when a new document is added. Maybe next week, I can add this feature and batch processing stuff.
Hi,
Hope you are all well !
It would be useful to add custom attributes like the doc_id when indexing or retrieving similar documents.
Thanks in advance :-) for any insights or inputs on that.
Cheers, X