bheinzerling / bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
https://nlp.h-its.org/bpemb
MIT License
1.18k stars 101 forks source link

How do you get the embedding/id for the pad token ? #25

Closed derlin closed 5 years ago

derlin commented 5 years ago

Hi, This may be a dummy question, but when creating a BPEmb with add_pad_emb=True, how do I actually get the padding embedding and what is its ID ? This should maybe figure somewhere in the doc:

from bpemb import BPEmb
bp = BPEmb(lang="en", vs=1000, dim=50, add_pad_emb=True)
print(bp.vs) # prints 1000, was expecting 1000+1 ?

Thanks for the great work, derlin

derlin commented 5 years ago

Found it in the gensim doc:

bp.emb['<pad>']

It seems to (always ?) be an array of zeros.

bheinzerling commented 5 years ago

Hi, not a dummy question at all. I've added this to the docstring:

This embedding is initizalized with zeros and appended to the end
of the embedding matrix. Assuming "bpemb" is a BPEmb instance, the
padding embedding can be looked up with "bpemb['<pad>']", or
directly accessed with "bpemb.vectors[-1]".
derlin commented 5 years ago

This is perfect, thanks !