dottxt-ai / outlines

Structured Text Generation
https://dottxt-ai.github.io/outlines/
Apache License 2.0
11.38k stars 586 forks source link

Probabilities for choices #1230

Open cplonski20 opened 5 months ago

cplonski20 commented 5 months ago

I have seen multiple issues on this and one forked repo but can someone please clearly point me to where the code I need is to run the outlines structured inference (choices) with probabilities on the outputs.

Thanks so mcuh

rlouf commented 5 months ago

Could you give us the address to the forked repo?

cplonski20 commented 5 months ago

https://github.com/craft-ai/fork-outlines/tree/probabilities

rlouf commented 1 month ago

@cpfiffer could this be (partially) solved by #1408.

As discussed somewhere else, the true probabilities are obtained by summing over all the token words that correspond to each of the string. We can always provide the probabilities of each chosen path, but with a caveat.

cpfiffer commented 1 month ago

Yeah, I believe this would address it partially. As you note it's quite difficult to get probabilities for the whole sequence, as you would need to understand in the state machine when a choice is fully determined.

As an example, for choices ['cat', 'dog', 'catfish'] you have to enumerate the possible sequences until it's obvious what will be selected. For example, you need to sample all tokens that could begin cat or dog.

At the point a token beginning with d is sampled, you know that any sequence following is dog. cat and catfish are ambiguous until sampling either the end of the response (cat) or another token beginning with f (catfish).

For trivial choice sets this isn't that bad, but big and potentially ambiguous schemas could be difficult to calculate.

If you did actually implement this, you get various other superpowers elsewhere in the code, such as knowing the probability that a field was generated inside JSON, i.e.

class Choice(BaseModel):
    thing: Literal['a', 'b', 'c']
    number: int = Field(ge=0, le=10)

You could potentially get the probabilities for the fields thing and number to help understand model uncertainty at the field level.