Is there a way to specify the maximum number of subwords so that I can get an embedding of fixed size?

bheinzerling / bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)

MIT License

1.18k stars 101 forks source link

For one of my applications, I need a fixed length embedding for each word. However, in the current api, we do not have an option to specify it. For example, if i encode two words "invoice" and "operator" using an embedding length of 50, the word invoice gets an array of 50*3 size as it has been split into 3 subwords, but the word operator will get an array of size 50 as it has not been split by the encoder. Is there a way to get an array of constant size? Please see the output below for your reference.

bpemb_en.encode("invoice") ['▁inv', 'o', 'ice'] bpemb_en.encode("operator") ['▁operator']

np.size(bpemb_en.embed("operator")) 50 np.size(bpemb_en.embed("invoice")) 150

bheinzerling / bpemb

Is there a way to specify the maximum number of subwords so that I can get an embedding of fixed size? #33