Closed ericstj closed 3 months ago
The SentencePiece algorithm should be added to Microsoft.ML.Tokenizers. This is a dependency of LLaMATokenizer which we also wish to enable.
We can see reference implementations in https://github.com/microsoft/BlingFire (MIT license) https://github.com/google/sentencepiece (Apache license) https://github.com/huggingface/tokenizers (Apache license) https://huggingface.co/docs/transformers/main/en/model_doc/llama
Hugging face also has Llama2 - might be interesting to understand if that's also worth including or designing for later inclusion.
LLaMA Tokenizer: https://arxiv.org/abs/2203.13474 https://arxiv.org/pdf/2203.13474.pdf
Sentence Piece: https://arxiv.org/abs/1808.06226 https://arxiv.org/pdf/1808.06226.pdf
SentencePiece has tokenizers/normalizers/model-file structure. We only require a subset of this (Tokenizer).
We will port this subset and make it work with the data sources used by SentencePiece BPE tokenizer
The SentencePiece algorithm should be added to Microsoft.ML.Tokenizers. This is a dependency of LLaMATokenizer which we also wish to enable.
We can see reference implementations in https://github.com/microsoft/BlingFire (MIT license) https://github.com/google/sentencepiece (Apache license) https://github.com/huggingface/tokenizers (Apache license) https://huggingface.co/docs/transformers/main/en/model_doc/llama
Hugging face also has Llama2 - might be interesting to understand if that's also worth including or designing for later inclusion.
LLaMA Tokenizer: https://arxiv.org/abs/2203.13474 https://arxiv.org/pdf/2203.13474.pdf
Sentence Piece: https://arxiv.org/abs/1808.06226 https://arxiv.org/pdf/1808.06226.pdf