BoundaryML / baml

BAML is a language that helps you get structured data from LLMs, with the best DX possible. Works with all languages. Check out the promptfiddle.com playground
https://docs.boundaryml.com
Apache License 2.0
1.42k stars 52 forks source link

Parser Bug: Enums with shared substrings #1085

Closed hellovai closed 1 month ago

hellovai commented 1 month ago

When the LLM returns something more "approximate" our parsing algorithm can handle substring'ed aliases better.

For example:

enum Foo {
   A @alias("car")
   B @alias("car-2")
}

Raw LLM response.

The answer is not car or car-2!

We will currently parse this asFoo.A cause technically we find two instances of "car" and one of "car-2".

Fix: https://github.com/BoundaryML/baml/blob/cd6b141020ec8dfd2514c82ffffaebc5678a025b/engine/baml-lib/jsonish/src/deserializer/coercer/match_string.rs

Change string_match_strategy to account for substrings that are counted multiple times to only favor the longest possible one.

hellovai commented 1 month ago

For more context the proposed solution is something like:

  1. take all matches (same as before)
  2. Find all matches that overlap with other matches (new)
  3. Reduce to the most strict set (new)
  4. Run tie breaker (same as before)

In the example:

All matches: car x2 car-1 x1

Reduced to: car x1 car-1 x1

Tie breaker fails to disambiguate.

hellovai commented 1 month ago

added a branch with a unit test: https://github.com/BoundaryML/baml/pull/1088

Anyone is free to work on it from here:

cd $REPO/engine/baml-lib/jsonish/src
RUST_LOG=trace cargo test test_numerical_enum