Often LLM models are distributed on HuggingFace or similar where tokenizers are presumed created via transformers library. This often contains a bunch of json/txt files. I have found it hard to then now how to create a ML.Tokenizer from that. For example how would one create a tokenizer for:
Often LLM models are distributed on HuggingFace or similar where tokenizers are presumed created via transformers library. This often contains a bunch of json/txt files. I have found it hard to then now how to create a ML.Tokenizer from that. For example how would one create a tokenizer for:
https://huggingface.co/HuggingFaceTB/SmolLM-135M-Instruct/tree/main
Could there be a getting started document detailing how to load tokenizers from such files and how to identify what to use to load these?