bheinzerling / bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
https://nlp.h-its.org/bpemb
MIT License
1.18k stars 101 forks source link

[Question] How can we use BPEmb for large documents? #54

Closed neel04 closed 3 years ago

neel04 commented 3 years ago

First off, thanks a lot to everyone for providing such a wonderful library and especially @bheinzerling !

I wanted to use BPE embedding for large documents that contain multiple words. Before this, I was using Doc2Vec since it seemed to be an integrated package.

However, with the BPE embedding, I wanted to verify if this is what I should do:

  1. Obtain the pre-trained embedding with the corresponding vocabulary size and dimensions

  2. use the BPEmb.embed method to generate a vector for each word

  3. Repeat step 2 for all the words in a document

  4. Finally, average out the vectors and use that as the final Document vector

Does that seem right, or is there some key feature of this repo that allows me to enhance a step? I am also looking for any possible suggestions to the process :+1:

Thank you for taking out the time to read through my query! :hugs: and thanks in advance!!

bheinzerling commented 3 years ago

If I understand what you want to do correctly, you don't have to do step 2 for each word, you can just call the embed method on the entire document, something like:

doc_emb = bpemb.embed(doc_text).mean(axis=0)

This will segment the document text into BPE subwords, look up the corresponding embeddings, and average them. I've never used BPEmb to create document embeddings, though, so I don't know how well this will work. Let me know if you get good results :)

neel04 commented 3 years ago

Thank you for your reply! Unfortunately, I have already implemented a class that takes the mean of the nested array, but I would surely use your trick in the inferencing part.

I am confident that they would probably be better than normal vectors because subword embedding with small vocabulary increases the chance that a pre-trained vector is present (till now I have found no OOV error) but I would surely update you with the results to possibly help someone else out! :+1:

Again, thanks a lot for the quick reply!!!