[Bug?] Incompatible with Hugging Face `Tokenizers`

EricLBuehler commented 2 months ago

Hi @Dan-wanna-M!

I wanted to integrate your great work here into mistral.rs and Candle. However, when testing with the microsoft/Phi-3.5-mini-instruct model's tokenizer using the below code, I get an error.

pub fn new(grammar: &str, tokenizer: &Tokenizer) -> Result<Self> {
    let tokenizer_vocab = tokenizer.get_vocab(true);
    let mut id_to_tok = AHashMap::new();
    let mut id_to_tok_str = AHashMap::new();
    for (tok_str, id) in tokenizer_vocab {
        id_to_tok.insert(id, Token(tok_str.as_bytes().to_vec().into_boxed_slice()));
        id_to_tok_str.insert(id, tok_str);
    }
    let vocab = Vocabulary::new(id_to_tok, id_to_tok_str)?;
    Ok(Self {
        engine: Engine::new(grammar, vocab)?,
        vocab_size: tokenizer.get_vocab_size(true),
    })
}

Output:

WARN kbnf::vocabulary: The following bytes are not present in any token: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 127, 192, 193, 221, 222, 238, 241, 242, 243, 244, 245, 246, 247]. This likely indicates that the vocabulary loading code is wrong, the tokenizer is doing some creepy processing or the tokenizer is not UTF-8 compatible. Check the vocabulary loading code and the tokenizer code to fix any bug and/or consider processing the vocab like the tokenizer.    
thread '<unnamed>' panicked at /home/ericbuehler/.cargo/registry/src/index.crates.io-6f17d22bba15001f/kbnf-syntax-0.4.2/src/grammar.rs:36:14:
called `Option::unwrap()` on a `None` value

What does being UTF-8 compatible mean?
Should we be panicking here? Perhaps some invariant was invalidated, but this seemed concerning, especially since Vocabulary::new already returns a Result, so maybe we can just return an error for this.

Dan-wanna-M commented 2 months ago

What does being UTF-8 compatible mean?

The warning indicates some UTF-8 specific bytes are not present, which can be resulted from:

the vocabulary loading code is wrong
the tokenizer is doing some interesting preprocessing on the vocabulary
the tokenizer itself just does not support UTF-8

In this case I think it is because the tokenizer is doing some interesting preprocessing on the vocabulary. Huggingface's bbpe tokenizer encodes control bytes&non-ascii bytes in an interesting way and we need to decode it before passing it into KBNF Vocabulary. You can check this file to check the heuristic handling I used in formatron. I should make it clearer in the docs.

Should we be panicking here?

No, this indeed looks like a separate bug since vocabulary loading should not intervene with grammar creation. Specifically, suspicious vocabulary loading should not lead to any panics(hence a warning rather than a hard error). Could you share the KBNF grammar string you use?

Dan-wanna-M commented 2 months ago

I think I managed to reproduce the bug; I guess the start nonterminal(which default to start) is not present in your kbnf grammar and I somehow forgot to handle it when validating the grammar. It is fixed in 0.5.2 now.

Dan-wanna-M commented 1 month ago

@EricLBuehler Could you elaborate a bit about how you would like to integrate kbnf into candle.rs and candle-vllm? I have more time now and I would like to create a PR for the integration.

EricLBuehler commented 1 month ago

Hi @Dan-wanna-M! That sounds great. I have a PR here: https://github.com/EricLBuehler/mistral.rs/pull/815

Perhaps you could take a look?

Dan-wanna-M / kbnf

[Bug?] Incompatible with Hugging Face `Tokenizers` #18