Look into LangChain queries and general integration

gavento commented 1 year ago

Rationale:

Our focus has shifted from "primarily structured queries" to "primarily agent interactions"
LangChain seems to have improved a lot in the last ~3 months (and is also more widely used, has a ton of integrations etc.) (Tomáš: I just updated on this a few days ago)
QueryChains does not have a great usability for parsing response as data (XML-like tags have sort of low ceiling, and e.g. we still do not have standard formatting prompt text), and it would be a lot of work to add/improve it now.

We may want to just use LangChain for what it seems good at:

Parsers - in particular JSON parser, especially if even Llama/Falcon/davinci-003 can use it at least for simple prompts, also since this better feeds into returning / working with typed structs as @spirali proposed earlier
Engines (incl. hugging face etc)
Query formatting (semi-automatic parsed format instructions inclusion, bit better repeat-on-failure usability for simpler queries, visually distinguishing prompt parts/variables on debug inspection)
General integration story for people familiar with LangChain or code written in it, possibly using other people's chains (as subroutines)

And use&focus querychains (to be renamed) on:

Agent interactions, games, environments, message passing, moving towards more general interaction framework
Logging and inspection (possibly also logging LangChain internal calls as Contexts with tags)
Structured deliberation (when contexts and our API style are a better fit for the task)
Experiment logging, processing and post-processing
Ease of use for debates / environment interactions

gavento commented 1 year ago

Notes: Parsing LLM output seems still not entirely solved, not sure what most people use, seems to be wild (some even use "on the last line write X") but JSON (+schema +examples) seems to be one of the best ones.

[x] Test if simpler models we may want to try (Llama? Falcon? GPT3.0?) can mostly reliably use LangChain JSON via Pydantic as output.

Other examples of structured output include Guidance from MS, although it performs much better with models where you a) see the logits and b) can control the generation flow (inserting fixed tokens etc.). Likely not useful for us (yet) but sort of on the radar.

gavento commented 1 year ago

tl;dr: Works great with GPT-4 (every time), davinci-003, GPT-3.5, Claude and Falcon-40B-instruct (may require retries for complex JSON schemas). 13B param models mostly fail (curie) but may depend on fine-tuning for JSON (e.g. may vary across open-source models). Running larger OSS models (falcon-40b) still technically tricky (HF endpoints failed, or are slow).

Setup

Did some experiments with Pydantic JSON prompting and parsing using this helper function

def query_json(llm, type, prompt, **vars):
    if isinstance(prompt, str):
        prompt=PromptTemplate.from_template(prompt)
    assert "format_instructions" in prompt.input_variables
    parser = PydanticOutputParser(pydantic_object=type)
    input = prompt.format_prompt(format_instructions=parser.get_format_instructions(), **vars)
    if isinstance(llm, langchain.chat_models.base.BaseChatModel):
        output = llm(input.to_messages()).content
    else:
        output = llm(input.to_string())
    print(output)
    return parser.parse(output)

and Pydantic classes Joke and Nobelists:

class Joke(BaseModel):
    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")

class Gender(enum.Enum):
    m = 'M'
    f = 'F'
    other = 'other'

class Nobelist(BaseModel):
    year: int = Field(description="year of the award")
    # date: datetime.datetime # NB: datetime.datetime here fails to parse e.g. 2020-10-05
    date: datetime.date = Field(description="date of the award as YYYY-MM-DD") # This is more robust
    first_name: str = Field(description="first name of the nobelist")
    last_name: str = Field(description="last name of the nobelist")
    fields: List[str] = Field(description="major research fields the nobelist worked in")
    gender: Gender

class Nobelists(BaseModel):
    awards: List[Nobelist] = Field(description="list of awarded Nobel prizes")

Results

text-curie-001 (likely 7-13B params) messes up even Joke almost every time (even with minor tweaking)
tiiuae/falcon-7b (7B params) fails on Jokes (just repeats the schema, even with minor tinkering)
tiiuae/falcon-40b-instruct - 10/10 Jokes, 10/10 Nobelists (100-200 ms/token; had some issues running it earlier 1, 2)
text-davinci-003 (175B params): 10/10 Joke, 10/10 on Nobelists (failed some date formatting without "YYYY-MM-DD" format hint)
gpt-3.5-turbo (175B params): 10/10 Joke, 7-10/10 on Nobelists (3x complained it does not know - this seems random, 10/10 after just a minor prompt variation)
claude-1.3 (? params): 10/10 Joke, 9/10 on Nobelists (1x JSON not by schema)
gpt-4 (? params): 10/10 Joke, 10/10 on Nobelists

Could be likely better with repeated query (whether a simple repeat-on-failure, or auto-fixing, or retry-parser)

(Model size estimates from here)

gavento commented 1 year ago

Resolved by #7 (cb48ab63cb60ae6be485835dcc25aab2bb0022ea)

acsresearch / interlab

Look into LangChain queries and general integration #1

Setup

Results