KeremZaman / semantic-sh

semantic-sh is a SimHash implementation to detect and group similar texts by taking power of word vectors and transformer-based language models (BERT).
MIT License
24 stars 3 forks source link

add/return custom attributes #4

Closed ghost closed 4 years ago

ghost commented 4 years ago

Hi,

Hope you are all well !

It would be useful to add custom attributes like the doc_id when indexing or retrieving similar documents.

Thanks in advance :-) for any insights or inputs on that.

Cheers, X

KeremZaman commented 4 years ago

For retrieving similar documents, you can use get_similar_groups() method. To identify each document you add, you can build your own data structure to fulfill your needs.

A bit hacky but you can get use of the private class member _buckets of SemanticSimHash which is a dictionary consisting of bucket hash and document list pairs.

ghost commented 4 years ago

I was thinking to get the text's simhash and save it in the database, then add the doc.

It is gonna be quiet slow but the only solution available now.

What would be great is to return the simhash in the json when adding a document. It would save one process.

If we could return the text hash for each matched text in the search results it would solve this issue.

You mentioned a batch insert feature in the README.md, is it something you plan to add in the incoming days ? that would be awesome

KeremZaman commented 4 years ago

I see. It will be better to return hash of its bucket when a new document is added. Maybe next week, I can add this feature and batch processing stuff.