llama_get_logits_ith: invalid logits id -1 error when embedding=True

Expected Behavior

When using llama-cpp-python with Qwen2 model, the chat completion should work normally regardless of whether the embedding parameter is enabled or not.

Current Behavior

The model works fine when embedding=False, but throws an error llama_get_logits_ith: invalid logits id -1, reason: no logits when embedding=True.

Working Code Example

from llama_cpp import Llama

# This works fine
llm = Llama(
    model_path="./models/qwen2-0_5b-instruct-q8_0.gguf", 
    chat_format="chatml", 
    verbose=False
)

messages = [
    {"role": "system", "content": "Summarize this text for me: You are an assistant who creates short stories."},
    {"role": "user", "content": "Long ago, in a peaceful village, a little girl named Leah loved watching the stars at night..."}
]

response = llm.create_chat_completion(messages=messages)

'''
{'id': 'chatcmpl-17ca45ef-d13b-425a-96be-7631e3b9a7f4',
 'object': 'chat.completion',
 'created': 1730125699,
 'model': './models/qwen2-0_5b-instruct-q8_0.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': 'This text is a short story about a little girl named Leah who loves watching the stars at night. One day, she noticed a particularly bright star that seemed to wink at her, and she made a wish to become friends with the star. This star spirit helped Leah take her on a magical adventure among the stars, and she visited countless constellations and stardust rivers.'},
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 145, 'completion_tokens': 76, 'total_tokens': 221}
}
'''

# Works successfully

Error Reproduction

from llama_cpp import Llama

# This causes an error
llm = Llama(
    model_path="./models/qwen2-0_5b-instruct-q8_0.gguf", 
    chat_format="chatml", 
    verbose=False, 
    embedding=True  # Only difference is enabling embedding
)

messages = [
    {"role": "system", "content": "Summarize this text for me: You are an assistant who creates short stories."},
    {"role": "user", "content": "Long ago, in a peaceful village, a little girl named Leah loved watching the stars at night..."}
]

llm.create_chat_completion(messages=messages)
# Error: llama_get_logits_ith: invalid logits id -1, reason: no logits

embeddings = llm.create_embedding("Hello, world!")
# Here is normal

'''
{'object': 'list',
 'data': [{'object': 'embedding',
   'embedding': [[0.9160200953483582,
     5.090432167053223,
     1.487088680267334, ......
'''

Environment Info

Python version: 3.10
llama-cpp-python version: latest
Model: Qwen2-0.5B-Chat (GGUF format)

Steps to Reproduce

Install llama-cpp-python
Download Qwen2-0.5B-Chat GGUF model
Run the error reproduction code above with embedding=True

Additional Context

The error only occurs when:

The embedding parameter is set to True
Using the chat completion functionality

The model works fine for chat completion when embedding=False, suggesting this might be related to how the embedding functionality is implemented for this specific model.

abetlen / llama-cpp-python

llama_get_logits_ith: invalid logits id -1, reason: no logits #1812

llama_get_logits_ith: invalid logits id -1 error when embedding=True

Expected Behavior

Current Behavior

Working Code Example

Error Reproduction

Environment Info

Steps to Reproduce

Additional Context

Environment Info