Open sxlijin opened 3 weeks ago
@sxlijin Hi Sam, I am planning on trying this out first thing in the morning and will give feedback.
So some thoughts on this:
# (C) py
class PatientVisit(pydantic.BaseModel):
first_name: str
last_name: str
mrn: str = pydantic.Field(pattern=r'\d\d-\d\d-\d\d-\d', description="medical record number")
# (C) json
json_schema = PatientVisit.model_json_schema()
# (B) baml
tb = TypeBuilder()
tb.unstable_features.add_json_schema(json_schema)
baml_patient = await b.ExtractPatient( "My name is John Doe. My medical record number is 12-34-56-7", {"tb": tb}, )
3. Cast the instance of B<sub>py</sub> into C<sub>py</sub>, which will apply the custom validations and high level features of the original pydantic model.
```py
canonical_patient = PatientVisit(**baml_patient.dict())
Essentially each transformation from Cpy ⇒ Cjson ⇒ Bbaml ⇒ Bprompt ⇒ Bpy ⇒ Cpy is structurally isomorphic to each other. The advantage is that each representation has unique advantages in it's operating domain:
Cpy (Canonical Pydantic model):
Cjson (JSON Schema):
Bbaml (BAML representation):
However this is a really roundabout way of hacking in missing functionality. The JSON schema part isn't strictly necessary; it just happens to capture validation logic alongside the data schema. Perhaps there's a more elegant way of approaching this in the BAML language? Something like this hypothetical example?
type MRN string @pattern("\d\d-\d\d-\d\d-\d")
type Age int @range(0, 200)
class PatientVisit {
first_name string
last_name string
mrn MRN
age Age
}
Another advantage of type aliasing in BAML, is the ability to match on the pydantic type alias (of type MRN
or Age
) (after BAML generation), and attach custom validators, post-extraction that can't be captured with regex, such as external lookups via 3P API calls.
def validate_mrn(v):
is_valid = api_lookup(v)
return is_valid
...
if field.type_ == MRN and isinstance(v, str):
return validate_mrn(v)
I think currently we can do this by naming baml fields consistently, and name aliasing, then string match on the non-alias name. However we are trying to avoid string-matching type names and hacking in post-hoc validation logic, which would make the codebase messy and unmaintainable, especially as the number of fields types grow.
I quite like this idea. It overrides one of the biggest concerns I had around validation which was how do you deal with unions and allowing validations with unions. The answer in your proposal becomes a very elegant one: a validated type is inherently a new type, so you must indicate it such.
We can then later have the flexibility of allowing inline validations as well for convenience, but we wouldn't give any type_
specifier on the object.
class PatientVisit {
first_name string
last_name string
mrn string @pattern("\d\d-\d\d-\d\d-\d")
age Age
}
There's still a few open questions in my mind here such as how we would allow for constraints such as the length of a list, only one of many key, the actual types in multiple languages etc. But this new proposal is definitely worth a deeper look. I'll take a stab at writing a formal BAML Language spec for this over the weekend with a list of possible concerns and a more flushed out implementation.
But your well thought out design has inspired us to pick this up sooner than we anticipated!
This feature is now being discussed here: https://github.com/orgs/BoundaryML/discussions/786
Please follow the discussion for the latest and this issue will be updated when the document is ready to move on to the next stage
JSON schema has a lot of validation logic that BAML can't express, c.f. https://build.fhir.org/patient.schema.json.html