dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.04k stars 1.88k forks source link

[Tokenizers] How to load HF based tokenizers e.g. SmolLM #7197

Open nietras opened 4 months ago

nietras commented 4 months ago

Often LLM models are distributed on HuggingFace or similar where tokenizers are presumed created via transformers library. This often contains a bunch of json/txt files. I have found it hard to then now how to create a ML.Tokenizer from that. For example how would one create a tokenizer for:

https://huggingface.co/HuggingFaceTB/SmolLM-135M-Instruct/tree/main

Could there be a getting started document detailing how to load tokenizers from such files and how to identify what to use to load these?