Open lbeurerkellner opened 1 year ago
Hi Luca, Is there someone allotted to this? If not, I would like to take this up.
I am utilizing this repo in a project of ours, I would like to contribute.
Thanks for offering help.
I likely won't get to this in the next few weeks, so any help would be greatly appreciated. For reference, the implementation should end up being very similar to https://github.com/eth-sri/lmql/blob/main/src/lmql/runtime/tokenizers/hf_tokenizer.py. Also, for constraining, you need to make sure https://github.com/eth-sri/lmql/blob/main/src/lmql/runtime/tokenizer.py#L321 get_vocab
here can obtain the full mapping of subtoken strings to input IDs.
Let me know if you need any other pointers.
@lbeurerkellner Thanks for the starter information! I will go through those and reply here if I need any help.
Wanted to give an update. Took me quite some time but I successfully understood the issue. The fix will come shortly.
Awesome, looking forward to it. Feel free to reach out with any questions :)
Hi Luca, I ran into an issue while developing the new tokenizer class.
The VocabularyMatcher class tries to get the encoding for " ". Llama tokenizers are based on sentencepiece, " " is not tokenized in this method. Currently investigating what VocabularyMatcher does and possibly extending it to make it compatible with sentencepiece
It would be helpful to have some context around VocabularyMatcher.
Great to hear about progress here. You can have a look at https://github.com/eth-sri/lmql/blob/main/src/lmql/runtime/tokenizers/hf_tokenizer.py#L122.
In short, VocabularyMatcher needs space and newline to dissect the vocabulary for partial matches. The implementation above is for the HF Llama tokenizer, so I expect hard-coding values for space and NL like there, should do the trick.
Your fix seems to work. Thanks! I will complete the implementation and run some local instances.
Updated the PR - https://github.com/eth-sri/lmql/pull/208. It introduces a new tokenizer class - LlamaCPPTokenizer which uses sentencepiece.
Tested it with open_llama_3b local tokenizer model and hugging face tokenizer danielhanchen/open_llama_3b, outputs match.
For pure llama.cpp operation of LMQL, we should support the tokenizer that ships with llama.cpp, avoiding the need to install 'transformers' just for tokenisation.