ericstj commented 4 months ago

The SentencePiece algorithm should be added to Microsoft.ML.Tokenizers. This is a dependency of LLaMATokenizer which we also wish to enable.

Hugging face also has Llama2 - might be interesting to understand if that's also worth including or designing for later inclusion.

ericstj commented 3 months ago

SentencePiece has tokenizers/normalizers/model-file structure. We only require a subset of this (Tokenizer).

We will port this subset and make it work with the data sources used by SentencePiece BPE tokenizer

ericstj commented 3 months ago

dotnet / machinelearning

7078