abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
8.12k stars 967 forks source link

Speculative decoding gives weird results in v. 0.3 #1770

Closed mobeetle closed 1 month ago

mobeetle commented 1 month ago

I run the following example (form instructor lib - https://python.useinstructor.com/hub/llama-cpp-python/#llama-cpp-python)

`import llama_cpp from llama_cpp.llama_speculative import LlamaPromptLookupDecoding

import instructor

from pydantic import BaseModel from typing import List from rich.console import Console

llama = llama_cpp.Llama( model_path="../../models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf", n_gpu_layers=-1, chat_format="chatml", n_ctx=2048, draft_model=LlamaPromptLookupDecoding(num_pred_tokens=2), # (1)! logits_all=True, verbose=False, )

create = instructor.patch( create=llama.create_chat_completion_openai_v1, mode=instructor.Mode.JSON_SCHEMA, # (2)! )

text_block = """ In our recent online meeting, participants from various backgrounds joined to discuss the upcoming tech conference. The names and contact details of the participants were as follows:

During the meeting, we agreed on several key points. The conference will be held on March 15th, 2024, at the Grand Tech Arena located at 4521 Innovation Drive. Dr. Emily Johnson, a renowned AI researcher, will be our keynote speaker.

The budget for the event is set at $50,000, covering venue costs, speaker fees, and promotional activities. Each participant is expected to contribute an article to the conference blog by February 20th.

A follow-up meetingis scheduled for January 25th at 3 PM GMT to finalize the agenda and confirm the list of speakers. """

class User(BaseModel): name: str email: str twitter: str

class MeetingInfo(BaseModel): users: List[User] date: str location: str budget: int deadline: str

extraction_stream = create( response_model=instructor.Partial[MeetingInfo], # (3)! messages=[ { "role": "user", "content": f"Get the information about the meeting and the users {text_block}", }, ], stream=True, )

console = Console()

for extraction in extraction_stream: obj = extraction.model_dump() console.clear() # (4)! console.print(obj)`

The example gives strange results - extracted data do not correspond to the text_block and make no sense.

When I disable speculative decoding in v. 0.3 (comment out a draft model), or if I switch to v. 0.290 with a speculative decoding enabled, it works as expected, i.e. it extracts all required information correctly.

abetlen commented 1 month ago

@mobeetle should be fixed now in v0.3.1

mobeetle commented 1 month ago

Confirmed, working great in v 0.3.1