bheinzerling / bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
https://nlp.h-its.org/bpemb
MIT License
1.18k stars 101 forks source link

special tokens not handled #51

Closed dunovank closed 3 years ago

dunovank commented 3 years ago

It seems that special tokens are not respected by BPemb. For instance, "\<pad>" gets parsed into multiple subword tokens instead of being caught and assigned the appropriate index. This is true even when instantiating the instance with add_pad_emb=True

Input 1

bp = BPEmb(lang="en", dim=50, add_pad_emb=True)
text = '<pad> this is an example of a text <pad> sequence <pad>'
bp.encode(text)

Output 1:

['▁', '<', 'p', 'ad', '>', '▁this', '▁is', '▁an', '▁example', '▁of', '▁a', '▁text', '▁', '<', 'p', 'ad', '>', '▁sequence', '▁', '<', 'p', 'ad', '>']

Input 2

bp.encode_ids(text)

Output 2:

[9912, 0, 9929, 63, 0, 215, 80, 32, 1462, 27, 4, 2748, 9912, 0, 9929, 63, 0, 5304, 9912, 0, 9929, 63, 0]
bheinzerling commented 3 years ago

This is intended behaviour. <pad> isn't part of the subword segmentation model. Only tokens that are contained in the segmentation model's vocabulary (e.g. this one here) are not split, all others are split. add_pad_emb=True just adds a row to the embedding matrix, which is convenient in some use cases. Adding special tokens requires training a new subword segmentation model, which can be done with this package .

dunovank commented 3 years ago

Got it thanks for the speedy reply and clarification.