Closed mpagli closed 6 years ago
looks good to me!
BTW doesn't anyone offer some similar functionality for fasttext as it has the same problem?
On Mon, Jan 15, 2018 at 2:48 PM mpagli notifications@github.com wrote:
@mpagli https://github.com/mpagli requested your review on: epfml/sent2vec#17 https://github.com/epfml/sent2vec/pull/17 Cython wrapper to keep model in memory and infer embeddings.
— You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub https://github.com/epfml/sent2vec/pull/17#event-1424364908, or mute the thread https://github.com/notifications/unsubscribe-auth/AEaGR9bJcUGQKCU46DrqYuKWc5QNo4lQks5tK1c-gaJpZM4RecBv .
Yes, there is this package wrapping the official fasttext functionalities: https://pypi.python.org/pypi/fasttext It is also implemented using Cython.
Nice work! However, I found some problem:
@Domi-Zhang : 1 and 3 should be fixed by https://github.com/epfml/sent2vec/commit/0e54ad8bfa43fa5da30463b3586607508ac00606. Concerning 2, I believe the default behavior should be to return unormalized vectors. What I can do is modify the wrapper to let the user select between normalized or unormalized.
This PR adds the ability to keep a model in memory and get sentence embeddings directly from python. Previously to get embeddings we had to call the c++ executable which would:
After those steps we had to load the inferred embeddings from disk to use them in some program. This process is generating too many I/Os and forces us to reload the model whenever we want new embeddings.
To solve this problem, I wrote a Cython wrapper exposing the necessary functionalities to load models and infer embeddings:
The tokenization is not handled so anyone using this should handle the preprocessing himself.
To compile the module simply run
python setup.py build_ext --inplace
from thesrc
folder.