abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.81k stars 934 forks source link

Support for a limited vocabulary for generation #998

Open mgorenstein opened 9 months ago

mgorenstein commented 9 months ago

Is your feature request related to a problem? Please describe. I would like to constrain the model output to only use a custom vocabulary comprising a list of allowable words (or alternatively, to blacklist all other words in the vocabulary).

Describe the solution you'd like HuggingFace's transformer library features a bad_words_id keyword in the model.generate function that accepts a list of words to exclude from its output (some discussion of this feature here).

Describe alternatives you've considered Could this possibly be achieved with the use of a llama_cpp.LogitsProcessor? I am less familiar with this library and haven't found examples in a similar direction, so am unsure how straightforward this could be to implement using one of those.

brandonrobertz commented 9 months ago

You can do this now by streaming the response and excluding words yourself. E.g.:

stream = llm(
    "Don't say any bad words:",
    stream=True,
    echo=True
)
response = ""
for token in stream:
    choice = token['choices'][0]
    token = choice["text"]
    if token in BAD_WORDS_LIST:
        continue
    response += token

print("The LLM response w/o bad words:", response)
mgorenstein commented 9 months ago

Hi @brandonrobertz thanks for this suggestion!

I've modified my issue to be a bit clearer - basically I'd want to bias or constrain beam search so that the 'bad words' don't appear in subsequences during generation (or alternatively, to only allow specific words during generation), rather than filtering them out from a completed output.

brandonrobertz commented 9 months ago

Hi @brandonrobertz thanks for this suggestion!

I've modified my issue to be a bit clearer - basically I'd want to bias or constrain beam search so that the 'bad words' don't appear in subsequences during generation (or alternatively, to only allow specific words during generation), rather than filtering them out from a completed output.

I see. So you want an actual custom token sampler. If that's the case then you'd need to actually add your own in llama.cpp (or modify an existing one). Here's where the top_k sampler is (I suppose you could modify this and use a custom llama.cpp in vendor/llama.cpp):

https://github.com/ggerganov/llama.cpp/blob/948ff137ec37f1ec74c02905917fa0afc9b97514/llama.cpp#L7364-L7387

This library really just wraps llama.cpp and doesn't provide its own samplers and whatnot, AFAICT.

abetlen commented 9 months ago

@mgorenstein you can also do this with logit_bias or a custom LogitsProcessor, however this only works on the token level so it's not perfect.

mgorenstein commented 9 months ago

Thanks @abetlen, LogitsProcessor is the approach I ended up taking. Have a partially working solution that I set aside, will report back when it's further along.

JulesGM commented 5 months ago

bad_word_ids accept ngrams, you could've just tokenized your rejected word list

JulesGM commented 5 months ago

did you look into using transformers.Constraint?