Open EricLBuehler opened 2 months ago
What does being UTF-8 compatible mean?
The warning indicates some UTF-8 specific bytes are not present, which can be resulted from:
In this case I think it is because the tokenizer is doing some interesting preprocessing on the vocabulary. Huggingface's bbpe tokenizer encodes control bytes&non-ascii bytes in an interesting way and we need to decode it before passing it into KBNF Vocabulary. You can check this file to check the heuristic handling I used in formatron
. I should make it clearer in the docs.
Should we be panicking here?
No, this indeed looks like a separate bug since vocabulary loading should not intervene with grammar creation. Specifically, suspicious vocabulary loading should not lead to any panics(hence a warning rather than a hard error). Could you share the KBNF grammar string you use?
I think I managed to reproduce the bug; I guess the start nonterminal(which default to start
) is not present in your kbnf grammar and I somehow forgot to handle it when validating the grammar. It is fixed in 0.5.2
now.
@EricLBuehler Could you elaborate a bit about how you would like to integrate kbnf into candle.rs and candle-vllm? I have more time now and I would like to create a PR for the integration.
Hi @Dan-wanna-M! That sounds great. I have a PR here: https://github.com/EricLBuehler/mistral.rs/pull/815
Perhaps you could take a look?
Hi @Dan-wanna-M!
I wanted to integrate your great work here into mistral.rs and Candle. However, when testing with the
microsoft/Phi-3.5-mini-instruct
model's tokenizer using the below code, I get an error.Output:
Vocabulary::new
already returns a Result, so maybe we can just return an error for this.