KeremZaman / semantic-sh

semantic-sh is a SimHash implementation to detect and group similar texts by taking power of word vectors and transformer-based language models (BERT).
MIT License
24 stars 3 forks source link

save/load custom models #3

Closed ghost closed 3 years ago

ghost commented 4 years ago

Hi,

Hope you are all well !

Is it possible to save/dump models and to load them again afterwards ? avoiding the re-index all documents because I have 230k of them.

Cheers, X

KeremZaman commented 4 years ago

Hi, thanks.

I can add some save/load functionality for document table in a simple way by using pickle, but maybe you need to consider using some DBMS to deal with such amount of documents.

ghost commented 4 years ago

I store actual documents in a mysql database. Is it possible to do it with mysql ?

KeremZaman commented 4 years ago

I think it's more convenient to use a solution like mongodb where you can store data in JSON structure. If you use document hashes as keys and store correspondent lists of documents as values, you can directly access to documents via hashes in a similar manner with semantic-sh.

Anyway you should dump projection matrix to use same hash function each time. I can add load/save functionality in a few days. You can use this to store model and hash function and use mongodb-like solution for storing documents.

ghost commented 4 years ago

Do you have a twitter account so I can DM you so questions without polluting the issue ? My twitter account is: https://twitter.com/x0rzkov