Add support for SentencePiece Tokenizers to add special tokens

Feature request

Add support for SentencePiece Tokenizers to add special tokens similar to the HuggingFace add_special_tokens method. A first step without also implementing resizing the embedding matrix is to allow marking tokens already known to the vocabulary as special tokens so that they are always tokenized before non-special tokens.

Motivation

This feature is needed if we pre-train a model based on the sentence-piece tokenizer and want to, e.g., instruction-tune it. Alternatively, we must first convert the model and tokenizer to a HuggingFace model and tokenizer.

Modalities / modalities

Add support for SentencePiece Tokenizers to add special tokens #222

Feature request

Motivation