Modalities / modalities

Modalities, a PyTorch-native framework for distributed and reproducible foundation model training.
MIT License
63 stars 8 forks source link

Add support for SentencePiece Tokenizers to add special tokens #222

Open lllAlexanderlll opened 3 months ago

lllAlexanderlll commented 3 months ago

Feature request

Add support for SentencePiece Tokenizers to add special tokens similar to the HuggingFace add_special_tokens method. A first step without also implementing resizing the embedding matrix is to allow marking tokens already known to the vocabulary as special tokens so that they are always tokenized before non-special tokens.

Motivation

This feature is needed if we pre-train a model based on the sentence-piece tokenizer and want to, e.g., instruction-tune it. Alternatively, we must first convert the model and tokenizer to a HuggingFace model and tokenizer.