Closed neel04 closed 3 years ago
If I understand what you want to do correctly, you don't have to do step 2 for each word, you can just call the embed method on the entire document, something like:
doc_emb = bpemb.embed(doc_text).mean(axis=0)
This will segment the document text into BPE subwords, look up the corresponding embeddings, and average them. I've never used BPEmb to create document embeddings, though, so I don't know how well this will work. Let me know if you get good results :)
Thank you for your reply! Unfortunately, I have already implemented a class that takes the mean of the nested array, but I would surely use your trick in the inferencing part.
I am confident that they would probably be better than normal vectors because subword embedding with small vocabulary increases the chance that a pre-trained vector is present (till now I have found no OOV error) but I would surely update you with the results to possibly help someone else out! :+1:
Again, thanks a lot for the quick reply!!!
First off, thanks a lot to everyone for providing such a wonderful library and especially @bheinzerling !
I wanted to use BPE embedding for large documents that contain multiple words. Before this, I was using
Doc2Vec
since it seemed to be an integrated package.However, with the BPE embedding, I wanted to verify if this is what I should do:
Obtain the pre-trained embedding with the corresponding vocabulary size and
dimensions
use the
BPEmb.embed
method to generate a vector for each wordRepeat step 2 for all the words in a document
Finally, average out the vectors and use that as the final Document vector
Does that seem right, or is there some key feature of this repo that allows me to enhance a step? I am also looking for any possible suggestions to the process :+1:
Thank you for taking out the time to read through my query! :hugs: and thanks in advance!!