Closed mehdihadeli closed 1 week ago
Hi,
Not at this time. However, you can use the vocab / config files provided in the HuggingFace model repos to create your tokenizers (if that tokenizer is supported by Microsoft.ML.Tokenizers.
For example (e5-small model which uses a BERT Tokenizer): https://huggingface.co/intfloat/e5-small-v2/resolve/main/vocab.txt
var tokenizer = BertTokenizer.Create(Path.Join("assets", "vocab.txt"));
Hi,
Thanks for your response.
Unfortunately, most of the models like Xenova/llama3-tokenizer don't have vocab file.
could we use tokenizer.json
for passing to BertTokenizer.Create();
method?
@mehdihadeli You can parse vocab and merge.txt from tokenizer.json
vocab.json is saved under $root.model.vocab
path and it's a dictionay<string, int>
merge is saved under .model.merges
and it's a string array
@LittleLittleCloud Thanks for your response
Hi, Is it possible using pre-train huggingface tokenizers with ML.Net Tokenizer like this?