bheinzerling / bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
https://nlp.h-its.org/bpemb
MIT License
1.18k stars 101 forks source link

Can I add <pad>? #52

Closed Randool closed 3 years ago

Randool commented 3 years ago

Thanks for giving this easy-to-use tool.

Currently, I want to use bpemb in my project. In order to process sentences with different lengths in a batch, I have to add padding token <pad> after some sentences. But I find it impossible to do that because bpemb will tear down the `

One way I can think of is to forcibly add <pad> through a complex process. But this process is a bit painful. So, are there other more flexible methods?

By the way, I found that the bpe token with id 0 seldom occurs in processed ids. Can I use it as the padding token?

Randool commented 3 years ago

Oh, I find the option add_pad_emb. The problem is solved