Tokenizer: Add support for tokenisation via llama-cpp-python

eth-sri / lmql

A language for constraint-guided and efficient LLM programming.

https://lmql.ai

Apache License 2.0

3.66k stars 197 forks source link

Tokenizer: Add support for tokenisation via llama-cpp-python #162

Open lbeurerkellner opened 1 year ago

lbeurerkellner commented 1 year ago

For pure llama.cpp operation of LMQL, we should support the tokenizer that ships with llama.cpp, avoiding the need to install 'transformers' just for tokenisation.

khushChopra commented 1 year ago

Hi Luca, Is there someone allotted to this? If not, I would like to take this up.

I am utilizing this repo in a project of ours, I would like to contribute.

lbeurerkellner commented 1 year ago

Thanks for offering help.

I likely won't get to this in the next few weeks, so any help would be greatly appreciated. For reference, the implementation should end up being very similar to https://github.com/eth-sri/lmql/blob/main/src/lmql/runtime/tokenizers/hf_tokenizer.py. Also, for constraining, you need to make sure https://github.com/eth-sri/lmql/blob/main/src/lmql/runtime/tokenizer.py#L321 get_vocab here can obtain the full mapping of subtoken strings to input IDs.

Let me know if you need any other pointers.

khushChopra commented 1 year ago

@lbeurerkellner Thanks for the starter information! I will go through those and reply here if I need any help.

khushChopra commented 1 year ago

Wanted to give an update. Took me quite some time but I successfully understood the issue. The fix will come shortly.

lbeurerkellner commented 1 year ago

Awesome, looking forward to it. Feel free to reach out with any questions :)

khushChopra commented 1 year ago

Hi Luca, I ran into an issue while developing the new tokenizer class.

The VocabularyMatcher class tries to get the encoding for " ". Llama tokenizers are based on sentencepiece, " " is not tokenized in this method. Currently investigating what VocabularyMatcher does and possibly extending it to make it compatible with sentencepiece

It would be helpful to have some context around VocabularyMatcher.

lbeurerkellner commented 1 year ago

Great to hear about progress here. You can have a look at https://github.com/eth-sri/lmql/blob/main/src/lmql/runtime/tokenizers/hf_tokenizer.py#L122.

In short, VocabularyMatcher needs space and newline to dissect the vocabulary for partial matches. The implementation above is for the HF Llama tokenizer, so I expect hard-coding values for space and NL like there, should do the trick.

khushChopra commented 1 year ago

Your fix seems to work. Thanks! I will complete the implementation and run some local instances.

khushChopra commented 1 year ago

Updated the PR - https://github.com/eth-sri/lmql/pull/208. It introduces a new tokenizer class - LlamaCPPTokenizer which uses sentencepiece.

Tested it with open_llama_3b local tokenizer model and hugging face tokenizer danielhanchen/open_llama_3b, outputs match.