BoundaryML / baml

BAML is a language that helps you get structured data from LLMs, with the best DX possible. Works with all languages. Check out the promptfiddle.com playground
https://docs.boundaryml.com
Apache License 2.0
1.39k stars 51 forks source link

Update SAP Parser to infer a better end of string when the LLM drops multiple characters #844

Open tomsquest opened 3 months ago

tomsquest commented 3 months ago

Hi,

I am starting to play with Baml, and I am very excited about it.

Problem

I stumble upon a strange case, where a test in PromptFilddle is failing, but the LLM output seems correct. Indeed, the test does not pass with the reason of a missing field, but the LLM seemed to have a correct output.

Detail

Here is the Raw llm Response as printed in PromptFiddle:

{
  "title": "L'année où j'ai vécu selon la Bible: Ou l'humble quête d'un homme qui cher"rating": 4,
  "volume": null,
  "description": null,
  "comment": null,
  "dateRead": null,
  "url": [
    "https://www.amazon.fr/Lann%C3%A9e-jai-v%C3%A9cu-selon-Bible/dp/2742789928/ref=sr_1_3?keywords=jacobs+bible"
  ],
  "slug": "lannee-ou-jai-vecu-selon-la-bible-ou-lhumbe-quete-dun-homme-qui-chercha-a-suivre-la-bible-aussi-litteralement-que-possible",
  "fiction": false
}

And the test fails with: Failed to coerce value: Error parsing '<root>': Missing required field: slug

Screenshot

image

Link to PromptFiddle

https://www.promptfiddle.com/extract-book-info-ystjv

tomsquest commented 3 months ago

Seeing the json as highlighted by GitHub, it seems the json is invalid (field unterminated: qui cher"rating": 4,). Can it be the problem?

hellovai commented 3 months ago

Yep you nailed it! This is a really interesting edge case! It seems like the LLM dropped teh quotation mark AND a , after

qui cher"rating

We have hueristics for how we infer mistakes in JSON, but i think we can do a bit smarter algorithsm to address this kind of issue.

(e.g. do a look ahead for all the known keys and see if we can find something else)

That said, we may also consider adding LLM based error correction for really weird edge cases like this one).

What happend here is that we inferred the end of the first string incorrectly. We've noted that this happens more when using non-english esp with structured data (as theres just less training data for structured non-english).

Will add this to the unit test suite we have and see what we can do about it! (Seems very tractable tho)

tomsquest commented 3 months ago

Excellent, thanks for the feedback !

hellovai commented 2 months ago

@tomsquest this took a bit to update, but it looks like this is a bit more complicated than it looks like on first glance. We'll likely have to update the parser to simulatenously explore looking for multiple keys as the content was just lost due bad string terminations. Any heuristic we apply here will likely lead to other cases being wrong.

The only approach that would likely work is:

  1. in the string, find all possible starts of keys, "title", "rating", etc, and then see if there is a start/end of values that would meet such constraints. I think something like that could work, but needs a bit more thought on exactly how this would impact existing situations. As far as i think, it would only be an improvement.

That said, i think the current bandwidth is a bit too full to support doing this work atm. How often is this happening in your use case?

tomsquest commented 2 months ago

Hi @hellovai ,

Thanks for your answer.

Tricky "json" parsing indeed!

I don't know if one could take some idea from project like json_repair (Not Rust, but for the idea).

That said, i think the current bandwidth is a bit too full to support doing this work atm

No urgency on my side, problem solved by switching to Claude Haiku.

How often is this happening in your use case?

It happened once using Gemini (I think it was the pro-001 model). Of course changing the prompt rerun the dices.

Oh FYI I wrote an article about how I use BAML to solve my problem (convert my list of books from text to json, thansk to Baml). Here it is: https://www.tomsquest.com/blog/2024/08/get-structured-output-from-llm-using-baml/

sxlijin commented 2 months ago

Of course changing the prompt rerun the dices.

FYI: you can set temperature=0 on pretty much all models to maximize their determinism (most of them default to temperature=1 which means that there's randomness baked into the response).

We don't currently provide a standardized way to do this, but setting options.temperature to 0 (example) will work for most models, since we forward temperature to the model provider.

tomsquest commented 2 months ago

Hi @sxlijin ,

Thanks for the tips, it's what I did for my "final" version (clients.baml). I don't remember if I encountered the malformed json prior to setting the temperature to 0 or not.