dandi / dandi-schema

Schemata for DANDI archive project
Apache License 2.0
5 stars 8 forks source link

Use discriminated unions to improve validation errors #244

Open mvandenburgh opened 2 weeks ago

mvandenburgh commented 2 weeks ago

We recently saw these metadata validation errors on a dandiset in production https://github.com/dandi/dandi-archive/issues/1958. These are the errors that were reported:

contributor: String should match pattern '^([\w\s\-\.']+),\s+([\w\s\-\.']+)$'
contributor: Input should be 'Organization'
contributor: String should match pattern 'https://ror.org/[a-z0-9]+$'

The invalid contributor in question turned out to be a Person with an invalid name field; in other words, the first validation error was the actual issue, while the other two were not relevant and somewhat misleading. What's happening here is pydantic has no idea from a validation perspective whether the object is intended to be a Person or an Organization , as contributor is of type List[Union[Person, Organization]], so it's checking both cases (i.e., first it validates the object as if it were a Person and gets the first error, then it validates it as a Organization and gets the other two errors).

I propose that we use discriminated unions on the schemaKey field of each pydantic model so we can avoid this in the future. This would allow pydantic to scope down the validation to the specific type of the object based on its schemaKey. If we had this in the above mentioned scenario, pydantic would have recognized that the invalid contributor is supposed to be a Person and would not have reported the additional misleading validation errors that assume it's an Organization.