epfml / sent2vec

General purpose unsupervised sentence representations
Other
1.19k stars 256 forks source link

Cython wrapper to keep model in memory and infer embeddings #17

Closed mpagli closed 6 years ago

mpagli commented 6 years ago

This PR adds the ability to keep a model in memory and get sentence embeddings directly from python. Previously to get embeddings we had to call the c++ executable which would:

After those steps we had to load the inferred embeddings from disk to use them in some program. This process is generating too many I/Os and forces us to reload the model whenever we want new embeddings.

To solve this problem, I wrote a Cython wrapper exposing the necessary functionalities to load models and infer embeddings:

import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model('model.bin')
emb = model.embed_sentence("once upon a time .")

The tokenization is not handled so anyone using this should handle the preprocessing himself.

To compile the module simply run python setup.py build_ext --inplace from the src folder.

martinjaggi commented 6 years ago

looks good to me!

BTW doesn't anyone offer some similar functionality for fasttext as it has the same problem?

On Mon, Jan 15, 2018 at 2:48 PM mpagli notifications@github.com wrote:

@mpagli https://github.com/mpagli requested your review on: epfml/sent2vec#17 https://github.com/epfml/sent2vec/pull/17 Cython wrapper to keep model in memory and infer embeddings.

— You are receiving this because your review was requested.

Reply to this email directly, view it on GitHub https://github.com/epfml/sent2vec/pull/17#event-1424364908, or mute the thread https://github.com/notifications/unsubscribe-auth/AEaGR9bJcUGQKCU46DrqYuKWc5QNo4lQks5tK1c-gaJpZM4RecBv .

mpagli commented 6 years ago

Yes, there is this package wrapping the official fasttext functionalities: https://pypi.python.org/pypi/fasttext It is also implemented using Cython.

Domi-Zhang commented 6 years ago

Nice work! However, I found some problem:

  1. 'embed_sentence' could not handle "UTF-8" encoding sentence, and many Chinese chars in my corpus.
  2. It should be better if the output vector has been normalized before return.
  3. I got a new package named "UNKNOWN-0.0.0" after "python setup.py install .", it's not good for package management.
mpagli commented 6 years ago

@Domi-Zhang : 1 and 3 should be fixed by https://github.com/epfml/sent2vec/commit/0e54ad8bfa43fa5da30463b3586607508ac00606. Concerning 2, I believe the default behavior should be to return unormalized vectors. What I can do is modify the wrapper to let the user select between normalized or unormalized.