epfml / sent2vec

General purpose unsupervised sentence representations
Other
1.19k stars 256 forks source link

Possibility to keep the model loaded in memory? #16

Closed Tiriar closed 6 years ago

Tiriar commented 6 years ago

Hello,

I am working on a QA system and I would like to use your embeddings to work with the sentences the user gives me. One problem is that when I get a new sentence and I want to transform it, sent2vec loads the whole model into memory and creates the embedding. When a new sentence comes, it repeats this process. Since the model is cached, the repeated loading of the vocabulary is fast after the first load, but I believe that it is not the best approach for a system that should work for a longer period of time. Is it somehow possible to load and keep the model in RAM, while just sending individual sentences to transform them? (I am using the python wrapper you provide as an example in the repo).

One other question (not worth creating a new post for that I believe). Can you please tell me what is the vocabulary size of the pre-trained models you provide? Unless I misunderstood the numbers in the paper, there are only the numbers of words in the training corpus and since I am using the python wrapper, it is kinda hard to work out the vocabulary sizes from the binary files.

Thank you :)

mpagli commented 6 years ago

The vocabularies used are quite large, the size for the twitter models might be of about 1 million. The toronto and wiki ones are smaller. The infrequent tokens have been found not to contribute a lot to the final accuracy of the representations and we could have probably trained a smaller model with similar results.

Concerning your problem of keeping the model in memory. We agree this is an important limitation and I'm investigating ways to solve it. In the meantime you can amortize the loading time by sending large batches (like 500k or more) to be inferred.

Tiriar commented 6 years ago

Thank you for your reply, I am currently using the model with larger batches (only for testing), but the resulting system will have to work in real time. I am trying multiple embedding algorithms and your sent2vec currently looks like the best candidate for the resulting system (great job btw, thank you :+1: ), if it wasn't for the model loading. I am in no rush, I am creating the system as a diploma thesis and I still have a few months, in the worst case I will have to use a worse performing or a slower algorithm.

On another note, since the pre-trained models include so many rare words, would it also be possible to load only lets say first K most frequent words from the vocabulary? If the words in the pre-trained models are sorted by their frequency, I believe it should not be hard to implement. It would be interesting to see how much it would influence the performance of the embeddings and it could lower the memory requirements quite a bit.

Thanks again :)

Domi-Zhang commented 6 years ago

Have you found a solution? I'm also desire the real time query.

mpagli commented 6 years ago

Check https://github.com/epfml/sent2vec/pull/17

You can install the module globally on your system by doing (from the src folder):

python setup.py build_ext
sudo pip install .

I'll improve the wrapper in the following weeks, you might have some bugs, let me know.

@Tiriar : good to see that you found our models useful :) ! Concerning your question on loading only the K first words in the vocabulary, this should be possible with some refactoring of the code, you would need to truncate the embedding matrices. I need to think more about how feasible this is. I'm also interested in lowering the memory requirements of the trained models.

Tiriar commented 6 years ago

I tried the wrapper today - works like a charm :-). Thank you :+1: