eth-sri / lmql

A language for constraint-guided and efficient LLM programming.
https://lmql.ai
Apache License 2.0
3.71k stars 202 forks source link

Best way for structured output and few-shot optimization #262

Open motaatmo opened 1 year ago

motaatmo commented 1 year ago

Hi,

I'm trying to extract some information from documents into a machine-readable format. One of the key points is that the result has to contain some lists that might have different lengths. I furthermore want to use few-shot optimization to help enforce the correct output. My current attempt, using dataclasses, fails (neither when calling run_sync, which is going to be the way that I'm using, nor on the playground, both using text-davinci-003). I tried a toy situation, see below, which is not working - is there something I'm doing wrong?

from dataclasses import dataclass
from typing import List
@dataclass
class ball_set:
    ball_colors: List[str]
    number_of_balls: int

argmax
    """
    The characteristics of a set of balls should be extracted as a python dataclass. The number and colors of balls should be returned.
    Example 1:
    There are two red balls and a white one in the set
    Result 1: ball_set(ball_colors=[["red","white"]],
                       number_of_balls: 3)
    Example 2:
    There are eight white balls in the set
    Result 2: ball_set(ball_colors:[["white"]],
                number_of_balls: 8)

    The following set should be described:
    There is one brown, one yellow, and a white ball in the set

    Result: [NEW_SET]
    """
where
    type(NEW_SET) is ball_set
motaatmo commented 1 year ago

As I don't really need a dataclass as a return value (as long as the results are parsed correctly), I tried a different way to query lmql, which also fails in interesting ways. My full lmql query is both to long, and to full of stuff I don't really want to disclose publicly, but the toy example works (or fails) in a similar way. This one works:

argmax
    """
    The characteristics of a set of balls should be extracted as a python dataclass. The number and colors of balls should be returned.
    Example 1:
    There are two red balls and a white one in the set
    Result 1: ball_colors: [["red","white"]]
              number_of_balls: 3
    Example 2:
    There are eight white balls in the set
    Result 2: ball_colors: [["white"]]
              number_of_balls: 8

    The following set should be described:
    There is one brown, one yellow, and a white ball in the set

    Result:   ball_colors: [BALL_COLORS]
              number_of_balls: [NUMBER_OF_BALLS] 
    """
where
    INT(NUMBER_OF_BALLS) and STOPS_AT(BALL_COLORS, "]")

This one, where the ordering of the variables is inverted, fails:

argmax
    """
    The characteristics of a set of balls should be extracted as a python dataclass. The number and colors of balls should be returned.
    Example 1:
    There are two red balls and a white one in the set
    Result 1: number_of_balls: 3
              ball_colors: [["red","white"]]

    Example 2:
    There are eight white balls in the set
    Result 2: number_of_balls: 8
              ball_colors: [["white"]]

    The following set should be described:
    There is one brown, one yellow, and a white ball in the set

    Result:   number_of_balls: [NUMBER_OF_BALLS]
              ball_colors: [BALL_COLORS]
    """
where
    INT(NUMBER_OF_BALLS) and STOPS_AT(BALL_COLORS, "]")