eth-sri / lmql

A language for constraint-guided and efficient LLM programming.
https://lmql.ai
Apache License 2.0
3.68k stars 199 forks source link

[Error during generate()] index out of range in self #272

Open akhilrazdan opened 12 months ago

akhilrazdan commented 12 months ago

I am trying to run the following and getting an error

import lmql
query_string = """
    "Hello [WHO]"
"""

result = await lmql.run(query_string, model="local:gpt2")
print(result)

This is throwing an error:

[gpt2 ready on device cpu]
'NoneType' object has no attribute 'cadam32bit_grad_fp32'
[Error during generate()] index out of range in self

This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (1024). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.
.../src/lmql/runtime/dclib/dclib_cache.py:600: CacheWarning: CachedDcModel: underlying model <lmql.models.lmtp.lmtp_dcmodel.LMTPDcModel object at 0x2b724a0b0> raised an exception during generation and will thus not be cached: <class 'lmql.models.lmtp.errors.LMTPStreamError'> 'failed to generate tokens 'index out of range in self''
  warnings.warn(msg, category=CacheWarning)

And the cell takes forever to return. Any pointers as to what could be happening?

lbeurerkellner commented 12 months ago

Hi there. The error message suggests there may be an issue with your installation of bitsandbytes or transformers. Maybe this helps: https://github.com/oobabooga/text-generation-webui/issues/2397

gijswijnholds commented 9 months ago

Hi, here's some more info suggestion the problem is with the max length of transformers model getting exceeded (similar to related but non-LMQL posts on this error: https://stackoverflow.com/questions/62081155/pytorch-indexerror-index-out-of-range-in-self-how-to-solve and https://discuss.huggingface.co/t/adding-new-tokens-indexerror-index-out-of-range-in-self/6731)

Here's what I do: first of all, I made a clean conda install with python 3.10.13 and 'pip install lmql[hf]'. Then, I start a server locally with

lmql serve-model

Then I try two things:

  1. I try to run a decorated query
import lmql

@lmql.query
def simple_question(question):
    '''lmql
    "The answer is: [ANSWER]."
    return ANSWER
    '''

answer = simple_question("What is the meaning of life?", model="gpt2", temperature=0.5)

This generates the error, but the server outputs "This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (1024). Depending on the model, you may observe exceptions, performance degradation, or nothing at all."

I can retry and set the maximum length

answer = chain_of_thought("What is the meaning of life?", model="gpt2", temperature=0.5, max_len=50)

but this will generate an (expected) AssertionError: AssertionError: The decoder returned a sequence that exceeds the provided max_len (max_len=50, sequence length=50). To increase the max_len, please provide a corresponding max_len argument to the decoder function.

So this route ends here for me, as the model will ultimately generate over its own max length and then raise the Index Error. Perhaps there is a way I could solve it myself, but it could also hint at a potential issue with either Transformers or LMQL.

So the other route:

  1. I use a length-constrained prompt with the Generations API:
prompt = """
"Greet the user in four different ways: [GREETINGS]" \
   where len(TOKENS(GREETINGS)) < 10
"""

m: lmql.LLM = lmql.model("gpt2")
m.generate_sync(prompt)

I get the same error, where despite the max token length constraint, the model will still keep generating. However, this time I can change the call to

m.generate_sync(prompt, max_tokens=10)

which works perfectly fine! However, the 'max_tokens' parameter is not available in the decorated query afaik and of course we want to specify such a constraint in the query itself!

Btw, I also tried with "gpt2-medium" which behaves the same. Did not try other models.

Hope this helps,