dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.91k stars 1.86k forks source link

Support special tokens in sentence piece bpe #7158

Open LittleLittleCloud opened 1 month ago

LittleLittleCloud commented 1 month ago

Is your feature request related to a problem? Please describe. The phi-3 uses llama2 tokenizer with a few special tokens like <|user|> and <|system|>. But currently there is no way to add special tokens to sentence piece bpe (the llama 2 tokenizer) in mlnet Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.