Cython wrapper to keep model in memory and infer embeddings

epfml / sent2vec

General purpose unsupervised sentence representations

Other

1.19k stars 256 forks source link

Cython wrapper to keep model in memory and infer embeddings #17

Closed mpagli closed 6 years ago

mpagli commented 6 years ago

This PR adds the ability to keep a model in memory and get sentence embeddings directly from python. Previously to get embeddings we had to call the c++ executable which would:

load the model
read sentences from disk
infer embeddings and write them to disk

After those steps we had to load the inferred embeddings from disk to use them in some program. This process is generating too many I/Os and forces us to reload the model whenever we want new embeddings.

To solve this problem, I wrote a Cython wrapper exposing the necessary functionalities to load models and infer embeddings:

import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model('model.bin')
emb = model.embed_sentence("once upon a time .")

The tokenization is not handled so anyone using this should handle the preprocessing himself.

To compile the module simply run python setup.py build_ext --inplace from the src folder.

martinjaggi commented 6 years ago

looks good to me!

BTW doesn't anyone offer some similar functionality for fasttext as it has the same problem?

On Mon, Jan 15, 2018 at 2:48 PM mpagli notifications@github.com wrote:

@mpagli https://github.com/mpagli requested your review on: epfml/sent2vec#17 https://github.com/epfml/sent2vec/pull/17 Cython wrapper to keep model in memory and infer embeddings.

— You are receiving this because your review was requested.

Reply to this email directly, view it on GitHub https://github.com/epfml/sent2vec/pull/17#event-1424364908, or mute the thread https://github.com/notifications/unsubscribe-auth/AEaGR9bJcUGQKCU46DrqYuKWc5QNo4lQks5tK1c-gaJpZM4RecBv .

mpagli commented 6 years ago

Yes, there is this package wrapping the official fasttext functionalities: https://pypi.python.org/pypi/fasttext It is also implemented using Cython.

Domi-Zhang commented 6 years ago

Nice work! However, I found some problem:

'embed_sentence' could not handle "UTF-8" encoding sentence, and many Chinese chars in my corpus.
It should be better if the output vector has been normalized before return.
I got a new package named "UNKNOWN-0.0.0" after "python setup.py install .", it's not good for package management.

mpagli commented 6 years ago

@Domi-Zhang : 1 and 3 should be fixed by https://github.com/epfml/sent2vec/commit/0e54ad8bfa43fa5da30463b3586607508ac00606. Concerning 2, I believe the default behavior should be to return unormalized vectors. What I can do is modify the wrapper to let the user select between normalized or unormalized.