bheinzerling / bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
https://nlp.h-its.org/bpemb
MIT License
1.18k stars 101 forks source link

Is there a way to specify the maximum number of subwords so that I can get an embedding of fixed size? #33

Closed subrahmanyap closed 4 years ago

subrahmanyap commented 4 years ago

For one of my applications, I need a fixed length embedding for each word. However, in the current api, we do not have an option to specify it. For example, if i encode two words "invoice" and "operator" using an embedding length of 50, the word invoice gets an array of 50*3 size as it has been split into 3 subwords, but the word operator will get an array of size 50 as it has not been split by the encoder. Is there a way to get an array of constant size? Please see the output below for your reference.

bpemb_en.encode("invoice") ['▁inv', 'o', 'ice'] bpemb_en.encode("operator") ['▁operator']

np.size(bpemb_en.embed("operator")) 50 np.size(bpemb_en.embed("invoice")) 150

bheinzerling commented 4 years ago

Mean pooling is probably the simplest way:

>>> emb = bpemb_en.embed("invoice")
>>> emb.shape
(3, 50)
>>> pooled_emb = emb.mean(axis=0)
>>> pooled_emb.shape
>>> (50,)