dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.92k stars 1.86k forks source link

[Tokenizers] Port LLaMA Tokenizer and SentencePiece algorithm #6987

Closed ericstj closed 3 months ago

ericstj commented 4 months ago

The SentencePiece algorithm should be added to Microsoft.ML.Tokenizers. This is a dependency of LLaMATokenizer which we also wish to enable.

We can see reference implementations in https://github.com/microsoft/BlingFire (MIT license) https://github.com/google/sentencepiece (Apache license) https://github.com/huggingface/tokenizers (Apache license) https://huggingface.co/docs/transformers/main/en/model_doc/llama

Hugging face also has Llama2 - might be interesting to understand if that's also worth including or designing for later inclusion.

LLaMA Tokenizer: https://arxiv.org/abs/2203.13474 https://arxiv.org/pdf/2203.13474.pdf

Sentence Piece: https://arxiv.org/abs/1808.06226 https://arxiv.org/pdf/1808.06226.pdf

ericstj commented 3 months ago

SentencePiece has tokenizers/normalizers/model-file structure. We only require a subset of this (Tokenizer).

We will port this subset and make it work with the data sources used by SentencePiece BPE tokenizer

ericstj commented 3 months ago

7078