Hi all,
Is support planned for Tokenizers for SentencePiece models using the Unigram algorithm?
If I've checked correctly, I see that there's an Tokenizer implementation for SentencePiece models coming in the 4.0 release, but it seems to be limited to only supporting BPE models, in the form of LlamaTokenizer.
Specifically, I'd like to create a tokenizer for the Helsinki-NLP/opus-mt-xx-xx models from HuggingFace, but I am having difficulty creating tokenizers for those models.
Hi all, Is support planned for Tokenizers for SentencePiece models using the Unigram algorithm? If I've checked correctly, I see that there's an Tokenizer implementation for SentencePiece models coming in the 4.0 release, but it seems to be limited to only supporting BPE models, in the form of
LlamaTokenizer
.Specifically, I'd like to create a tokenizer for the Helsinki-NLP/opus-mt-xx-xx models from HuggingFace, but I am having difficulty creating tokenizers for those models.