sxlijin commented 3 weeks ago

JSON schema has a lot of validation logic that BAML can't express, c.f. https://build.fhir.org/patient.schema.json.html

accupham commented 3 weeks ago

@sxlijin Hi Sam, I am planning on trying this out first thing in the morning and will give feedback.

accupham commented 3 weeks ago

So some thoughts on this:

Every industry has domain-specific "string-like" or "number-like" fields that have formal constraints on them. At the very least, people have an intuition about whether a value is valid or not. 90% of the time you can capture this with regex and numeric ranges.
In the real world, strings are not just strings!

We can build a canonical pydantic model (C_py) of what we want to extract in python code. We can apply custom validators to the fields the way we are used to. Add pattern (regex) validators, min, max len constraints, etc etc. Add descriptions that end up in the LLM prompt.
```
# (C) py
class PatientVisit(pydantic.BaseModel):
first_name: str
last_name: str
mrn: str = pydantic.Field(pattern=r'\d\d-\d\d-\d\d-\d', description="medical record number")
```
Serialize pydantic model (C_py) into JSON schema (C_json), and build our dynamic BAML pydantic model (B_py) and extract into an instance of B_py.
```
# (C) json
json_schema = PatientVisit.model_json_schema()
# (B) baml
tb = TypeBuilder()
tb.unstable_features.add_json_schema(json_schema)
```

(B) py

baml_patient = await b.ExtractPatient( "My name is John Doe. My medical record number is 12-34-56-7", {"tb": tb}, )

3. Cast the instance of B<sub>py</sub> into C<sub>py</sub>, which will apply the custom validations and high level features of the original pydantic model. 
```py
canonical_patient = PatientVisit(**baml_patient.dict())

Essentially each transformation from C_py ⇒ C_json ⇒ B_baml ⇒ B_prompt ⇒ B_py ⇒ C_py is structurally isomorphic to each other. The advantage is that each representation has unique advantages in it's operating domain:

C_py (Canonical Pydantic model):
- Excellent for validation
C_json (JSON Schema):
- Language-agnostic representation
- Easily shareable and consumable by various tools
B_baml (BAML representation):
- Optimized for LLM interactions
- Potentially includes LLM-specific formatting or prompting techniques

However this is a really roundabout way of hacking in missing functionality. The JSON schema part isn't strictly necessary; it just happens to capture validation logic alongside the data schema. Perhaps there's a more elegant way of approaching this in the BAML language? Something like this hypothetical example?

type MRN string @pattern("\d\d-\d\d-\d\d-\d")
type Age int @range(0, 200)

class PatientVisit {
  first_name string
  last_name string
  mrn MRN
  age Age
}

Another advantage of type aliasing in BAML, is the ability to match on the pydantic type alias (of type MRN or Age) (after BAML generation), and attach custom validators, post-extraction that can't be captured with regex, such as external lookups via 3P API calls.

def validate_mrn(v):
    is_valid = api_lookup(v)
    return is_valid

...

if field.type_ == MRN and isinstance(v, str):
    return validate_mrn(v)

I think currently we can do this by naming baml fields consistently, and name aliasing, then string match on the non-alias name. However we are trying to avoid string-matching type names and hacking in post-hoc validation logic, which would make the codebase messy and unmaintainable, especially as the number of fields types grow.

hellovai commented 3 weeks ago

I quite like this idea. It overrides one of the biggest concerns I had around validation which was how do you deal with unions and allowing validations with unions. The answer in your proposal becomes a very elegant one: a validated type is inherently a new type, so you must indicate it such.

We can then later have the flexibility of allowing inline validations as well for convenience, but we wouldn't give any type_ specifier on the object.

class PatientVisit {
  first_name string
  last_name string
  mrn string @pattern("\d\d-\d\d-\d\d-\d")
  age Age
}

There's still a few open questions in my mind here such as how we would allow for constraints such as the length of a list, only one of many key, the actual types in multiple languages etc. But this new proposal is definitely worth a deeper look. I'll take a stab at writing a formal BAML Language spec for this over the weekend with a list of possible concerns and a more flushed out implementation.

But your well thought out design has inspired us to pick this up sooner than we anticipated!

hellovai commented 2 weeks ago

This feature is now being discussed here: https://github.com/orgs/BoundaryML/discussions/786

Please follow the discussion for the latest and this issue will be updated when the document is ready to move on to the next stage

BoundaryML / baml

BAML support for custom field/type validation logic #765

(B) py