dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.99k stars 1.88k forks source link

Implement Sentencepiece Unigram tokenizer #7186

Open arthurvb opened 2 months ago

arthurvb commented 2 months ago

Is your feature request related to a problem? Please describe. I want to use a multilingual model from Huggingface ( https://huggingface.co/intfloat/multilingual-e5-large ) and the tokenizer is a sentencepiece unigram tokenizer, so I am unable to port it to C#/ONNX

Describe the solution you'd like Support for the unigram sentencepiece tokenizer in the Microsoft.ML.Tokenizers package.

Describe alternatives you've considered Blingfire, but seems not maintained anymore and unclear if it would return exactly the same token-id's.

Thank you for your time and effort (the library in general is great!)

ericstj commented 1 month ago

@tarekgh do any of our existing tokenizers support this, or is this new work?

tarekgh commented 1 month ago

do any of our existing tokenizers support this, or is this new work?

This is a new model that needs to implement.