Using pre-train huggingface tokenizers

dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.

https://dot.net/ml

MIT License

9.05k stars 1.88k forks source link

Using pre-train huggingface tokenizers #7286

Closed mehdihadeli closed 1 week ago

mehdihadeli commented 2 weeks ago

Hi, Is it possible using pre-train huggingface tokenizers with ML.Net Tokenizer like this?

Tokenizer.from_pretrained("Xenova/llama-3-tokenizer")

luisquintanilla commented 2 weeks ago

Hi,

Not at this time. However, you can use the vocab / config files provided in the HuggingFace model repos to create your tokenizers (if that tokenizer is supported by Microsoft.ML.Tokenizers.

For example (e5-small model which uses a BERT Tokenizer): https://huggingface.co/intfloat/e5-small-v2/resolve/main/vocab.txt

https://github.com/luisquintanilla/RAGDotnetAIFundamentals/blob/4a5e0b5142b3d20b8bdd538cb90a311a3112f230/Program.cs#L9

var tokenizer = BertTokenizer.Create(Path.Join("assets", "vocab.txt"));

mehdihadeli commented 2 weeks ago

Hi, Thanks for your response. Unfortunately, most of the models like Xenova/llama3-tokenizer don't have vocab file. could we use tokenizer.json for passing to BertTokenizer.Create(); method?

LittleLittleCloud commented 1 week ago

@mehdihadeli You can parse vocab and merge.txt from tokenizer.json

vocab.json is saved under $root.model.vocab path and it's a dictionay<string, int>

merge is saved under .model.merges and it's a string array

mehdihadeli commented 1 week ago

@LittleLittleCloud Thanks for your response