Tokenizer for SentencePiece Unigram models

dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.

https://dot.net/ml

MIT License

9.05k stars 1.88k forks source link

Tokenizer for SentencePiece Unigram models #7294

Closed KasperNissen1997 closed 4 days ago

KasperNissen1997 commented 1 week ago

Hi all, Is support planned for Tokenizers for SentencePiece models using the Unigram algorithm? If I've checked correctly, I see that there's an Tokenizer implementation for SentencePiece models coming in the 4.0 release, but it seems to be limited to only supporting BPE models, in the form of LlamaTokenizer.

Specifically, I'd like to create a tokenizer for the Helsinki-NLP/opus-mt-xx-xx models from HuggingFace, but I am having difficulty creating tokenizers for those models.

ericstj commented 4 days ago

It's in our backlog, duplicate of https://github.com/dotnet/machinelearning/issues/7186