Closed dunovank closed 3 years ago
This is intended behaviour. <pad>
isn't part of the subword segmentation model. Only tokens that are contained in the segmentation model's vocabulary (e.g. this one here) are not split, all others are split.
add_pad_emb=True
just adds a row to the embedding matrix, which is convenient in some use cases.
Adding special tokens requires training a new subword segmentation model, which can be done with this package .
Got it thanks for the speedy reply and clarification.
It seems that special tokens are not respected by BPemb. For instance, "\<pad>" gets parsed into multiple subword tokens instead of being caught and assigned the appropriate index. This is true even when instantiating the instance with
add_pad_emb=True
Input 1
Output 1:
Input 2
Output 2: