dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.88k forks source link

Support special tokens in sentence piece bpe #7158

Closed LittleLittleCloud closed 4 months ago

LittleLittleCloud commented 6 months ago

Is your feature request related to a problem? Please describe. The phi-3 uses llama2 tokenizer with a few special tokens like <|user|> and <|system|>. But currently there is no way to add special tokens to sentence piece bpe (the llama 2 tokenizer) in mlnet Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

luisquintanilla commented 4 months ago

Is this something that could be more widely applicable beyond SentencePiece?

i.e. See template processing section in this link https://huggingface.co/docs/tokenizers/pipeline#all-together-a-bert-tokenizer-from-scratch