Overhaul models so that "unfinished" metadata can be represented without cheating Pydantic

dandi / dandi-schema

Schemata for DANDI archive project

Apache License 2.0

7 stars 10 forks source link

Overhaul models so that "unfinished" metadata can be represented without cheating Pydantic #205

Open jwodder opened 11 months ago

jwodder commented 11 months ago

(This is an accumulation of various things discussed elsewhere which need to be written down.)

Currently, it is possible for users of the Dandi Archive API to submit asset & Dandiset metadata that does not fully conform to the relevant dandischema model, and the Archive will accept, store, and return such flawed metadata, largely via use of DandiBaseModel.unvalidated (being replaced by Pydantic's construct in #203). I believe part of the motivation for this is so that web users can fill in metadata over multiple sessions without having to fill in every field in a single sitting.

This results in API requests for asset & Dandiset metadata sometimes returning values that do not validate under dandischema's models; in particular, if a user of dandi-cli's Python API calls get_metadata() instead of get_raw_metadata(), the call may fail because our API returned metadata that doesn't conform to our own models (See dandi/dandi-cli#1205 and dandi/dandi-cli#1363).

The dandischema models should therefore be overhauled as follows:

There should exist models for representing Dandiset & asset metadata in a draft/unfinished state. These models should accept all inputs that we want to accept from users (both via the API and the web UI), store in the database, and return in API responses. (It is likely that such models will have all of their fields marked optional aside from the absolute bare minimum required.)
- The get_metadata() methods of dandi-cli's Python API should return instances of these models.
There should exist functionality for determining whether an instance of a draft/unfinished model meets all of the requirements for Dandiset publication.
- One possible way to implement this would be to have separate models for published metadata that inherit from the draft models and make various fields non-optional.

CC @satra @dandi/dandiarchive @dandi/dandi-cli

yarikoptic commented 11 months ago

I thought to argue that it would add another combinatorial dimension, but well, we kinda already have it half way that in terms of separation of "Common" vs "Publishable"

dandischema/models.py:class CommonModel(DandiBaseModel):
dandischema/models.py:class Dandiset(CommonModel):
dandischema/models.py:class BareAsset(CommonModel):
dandischema/models.py:class Asset(BareAsset):
dandischema/models.py:class Publishable(DandiBaseModel):
dandischema/models.py:class PublishedDandiset(Dandiset, Publishable):
dandischema/models.py:class PublishedAsset(Asset, Publishable):

the question then is really to "shift" validation into something like .is_publishable() and may be indeed mirroring as PublishableAsset/PublishableDandiset counterparts with more restrictive model?

sneakers-the-rat commented 11 months ago

butting in here with uninvited 2 cents: to avoid needing to maintain a whole parallel set of models, you could use a field validator that indicates if validations fail by modifying a status field that indicates whether it is publishable or not.


@field_validator('*', mode='wrap')
@classmethod
def allow_optional[T](
    cls, 
    v: T, 
    handler: ValidatorFunctionWrapHandler, 
    info: ValidationInfo
) -> T:
    try:
        return handler(v)
    except ValidationError as e:
        # do something to indicate that we failed validation but still allow class instantiation, eg. 
        info.data['validation_errors'].append(e)
        return v

@model_validator(mode='after')
def validation_state(self):
    # do something here to check if we had validation errors that forbid us from being publishable

You could also dynamically control whether one should be validating publishable or draft standards using validator context like eg: https://docs.pydantic.dev/latest/concepts/validators/#using-validation-context-with-basemodel-initialization

jwodder commented 9 months ago

@yarikoptic I don't think I'm the right person to assign this to (at least not at this time). The task requires someone (or discussions with several someones) who knows what the "minimal metadata" requirements are, what the "completed metadata" requirements are, and what's allowed for the in-between.

yarikoptic commented 9 months ago

gotcha -- thanks. Let us dwell on this aspect -- may be during upcoming meetup with @satra at all.

yarikoptic commented 6 months ago

I have been thinking about this but still not sure if we ever would be able to easily guarantee that even published Dandiset's metadata confirms "The Model". The reason is the same what haunts NWB (and PyNWB in particular ATM): model versions and the fact that we can and do break backward compatibility (e.g. #235 would make possibly legit prior models now invalid).

So, unless we can come up with a way to have a model "Valid" according to a specific version (which we cannot at pydantic level since we have only 1 "current" model version), we cannot guarantee that past "Valid" models remain valid currently.

sneakers-the-rat commented 6 months ago

Seems like we would need model migration here, and that might be a good thing to have generally, no schema stays still ;).

I interject again here bc i have been wanting to do something similar and think it might be a nice little pydantic extension - Decorator/module level const gives model a particular version, each upgrade would need to provide a migration patch that can do up and down. When validating, run through each migration step and then report which versions failed - eg. if an optional param became required, the up method would check for that and fail otherwise.

I think that might be nicer than maintaining full copies of every version without clean ways to migrate between them. pydantic should allow model instantiation through validation errors in any case, and the migration methods would get us 'best effort' upgrades (eg. be able to perform renames and etc. but not magically fill in missing required values).

If the plan is to move to linkml eventually, this would be something i would be happy to work on with y'all in the pydantic generator, i have been doing bigg refactor work there and am in the process of finishing a patch to include all schema metadata, so if we want to make some helper code to do diffs between schema versions and use that to generate model migration code i would be super into working on.

edit: fwiw i have this mostly done for NWB in linkml by being able to pull versions of the schema and generate models on the fly - eg see the git provider and the schema provider where one just does PydanticProvider().get('core', '2.3.0') and gets back a module of models without needing to install a different version of the tool. that was written before i started refactoring linkml's pydanticgen so a lot of the bad parts in there can be replaced.

satra commented 6 months ago

thanks @sneakers-the-rat - i think this may align well with the transition to linkml. i know @candleindark is working on an updated reverse linkml generator from the current pydantic models in dandischema. i suspect this will be a priority for us to redo the models using linkml in the upcoming month. we will also want to separate out schema validation into two components: 1) are appropriate values being assigned and 2) are required values being assigned. requirement is a project specific thing and hence this will also allow us to reuse schemas. also 2 allows us to further stratify requirement given the state of an asset (pre-upload, uploaded, modified, published) or dandiset.

we need a specific project timeline/design for this transition.

sneakers-the-rat commented 6 months ago

Lmk how I can help - happy to hack on the pydantic generator to make it work to fit yalls needs bc working with DANDI + NWB in linkml is exactly within my short term plans :)

candleindark commented 6 months ago

@sneakers-the-rat My current plan is to improvement Pydantic to linkml generator as I participate in the transition to linkml in dandischema.

One approach is to build something that mimics the behavior of GenerateJsonSchema but instead of generating JSON schema, linkml schema will be generated. If this approach works, the parsing of the Pydantic models are already taken care of by the Pydantic library.

sneakers-the-rat commented 6 months ago

That would be great! There is already a sort of odd LinkMLGenerator but generalizing that out to accept pydantic models/json schema from them would be a very useful think for initial model import. LinkML definitely wants to work with linkml schema being the primary source of truth, aka do the pydantic model -> linkml schema conversion and from then on use the pydantic models generated from the linkml schema, but I have been working on the generator to make it easier to customize for specific needs eg. If yall want to separate some validation logic in a special way that the default generator doesnt do.

I overhauled the templating system recently to that end, see: https://linkml.io/linkml/generators/pydantic.html#templates And https://linkml.io/linkml/generators/pydantic.html#cmdoption-gen-pydantic-template-dir Programmatic example: https://github.com/linkml/linkml/blob/bd989b3a8d7fd336532a35a4019fca8877e915d6/tests/test_generators/test_pydanticgen.py#L956 CLI example: https://github.com/linkml/linkml/blob/bd989b3a8d7fd336532a35a4019fca8877e915d6/tests/test_scripts/test_gen_pydantic.py#L8

And im also working on making pydantic models fully invertible to linkml schema, so you could go DANDI models -> linkml schema -> linkml pydantic models -> customizef linkml DANDI models and then be able to generate the schema in reverse from those, but it might be more cumbersome to maintain than just customizing the templates and having the schema be the source of truth. See: https://github.com/linkml/linkml/pull/2036

that way you can also do stuff that cant be supported in pydantic like all the RDF stuff (but im working on that next ) ;)

sneakers-the-rat commented 6 months ago

(Maybe we should make a separate issue for linkml discussions, sorry to derail this one)

candleindark commented 5 months ago

butting in here with uninvited 2 cents: to avoid needing to maintain a whole parallel set of models, you could use a field validator that indicates if validations fail by modifying a status field that indicates whether it is publishable or not.

like
@field_validator('*', mode='wrap')
@classmethod
def allow_optional[T](
    cls, 
    v: T, 
    handler: ValidatorFunctionWrapHandler, 
    info: ValidationInfo
) -> T:
    try:
        return handler(v)
    except ValidationError as e:
        # do something to indicate that we failed validation but still allow class instantiation, eg. 
        info.data['validation_errors'].append(e)
        return v

@model_validator(mode='after')
def validation_state(self):
    # do something here to check if we had validation errors that forbid us from being publishable
You could also dynamically control whether one should be validating publishable or draft standards using validator context like eg: https://docs.pydantic.dev/latest/concepts/validators/#using-validation-context-with-basemodel-initialization

@sneakers-the-rat Thanks for the input. I think this is a solution to make field optional when receiving the user inputs while keeping the fields required at publication. However, one can't generate, at least not directly, two schema variants (one with the fields optional and the other with the fields required). It would be nice if we can the two different schema variants. Is SchemaView a potential solution to getting two schema variants?

sneakers-the-rat commented 5 months ago

Do you mean at a field level, being able to label a given slot as "toggle required" so at generation time you get two pydantic modules, one with those as required and one as optional? Or do you mean at a generator level, make two pydantic modules, one where all slots are required and one where all are optional?

Im assuming the former, where you want to make a subset of slots that are annotated as being part of a publishable criterion, and some computed slot is True if all those are met? I still am sorta thinking that generating two sets of modules would be a pain to maintain and use, and would personally lean towards a single set of models that can indicate what level of compliance the data is vs. Having several sets of nearly-identical models.

Another approach if ya still want to make multiple sets of models might be to do something like have one base model thats the most lax, then have a Publishable model inherit from that and use slot_usage to change the requiredness of some slots, and then you have a container class with a dataset slot with a any_of: [{range: DraftDataset}, {range: PublishableDataset}] and then pydantic validators would cast to the correct type based on what was present (probably would want to order them from most to least restrictive, but you know what I mean). That still sounds sort of hard to maintain and requires some additional container class, but might be a middle ground between one set of models and totally independent sets of models

sneakers-the-rat commented 5 months ago

Sorry to answer your question, I am betting that we could rig something up to generate different models on a switch. That would probably be easiest to do by making a step before schemaview where you load the schema, introspect on it to flip requireness depending on a flag, and then send that to schemaview. SV is sorta badly in need of a clarifying refactor bc at the moment its a bit of a "there be dragons" class (in my loving opinion, which is based on appreciating the winding road that led to SV)

candleindark commented 5 months ago

Do you mean at a field level, being able to label a given slot as "toggle required" so at generation time you get two pydantic modules, one with those as required and one as optional? Or do you mean at a generator level, make two pydantic modules, one where all slots are required and one where all are optional?

I am not clear about this, but this can be that I have never tried generating two modules from one schema.

Sorry to answer your question, I am betting that we could rig something up to generate different models on a switch. That would probably be easiest to do by making a step before schemaview where you load the schema, introspect on it to flip requireness depending on a flag, and then send that to schemaview. SV is sorta badly in need of a clarifying refactor bc at the moment its a bit of a "there be dragons" class (in my loving opinion, which is based on appreciating the winding road that led to SV)

Yes, this is close to what I had in mind. Have one LinkML schema, the base schema, as the source of truth. From the base schema, we generate variants from it by toggling the required slots for example. Have all these variants published as a documentation as well so that users can see what are needed at different stages of their metadata. From the different variants of the LinkML schema, we can generate different Python modules containing the Pydantic models. The name of a Pydantic model will stay the same across the different Python modules.

sneakers-the-rat commented 5 months ago

for that, probably the easiest way would be to use subsets, like

subsets:
  DraftSubset:
    rank: 0
    title: Draft Subset
    description: Set of slots that must be present for draft datasets
  PublishableSubset:
    rank: 1
    title: Publishable Subset
    description: Set of slots that must be present for publishable datasets

slots:
  optional_slot:
    required: false
    description: Only required for publishable datasets
    in_subset:
      - PublishableSubset
  required_slot:
    required: true
    description: Required on all datasets
    in_subset:
      - DraftSubset
      - PublishableSubset

and then you could either iter through slots and check the subsets they are in or use the get_subset method on SchemaView, modify the requiredness accordingly, then pass back to a new sv instance.

I still think that you can do this with one set of models! I am just imagining it being difficult to parse and also to write code for "ok now i import the DraftDataset model but when it comes time to publish i'll need to use PublishableDataset" but ultimately your call. Slight semantic difference between "these fields are optional for drafts but required for publishable datasets" to "these fields are optional, but the dataset is not publishable if they are absent."

we would just need to improve the pydantic generator to support equals_expression (which should happen anyway).

So then you would do something like

classes:
  Dataset
    attributes:
      attr1: "..."
      attr2: "..."
      publishable:
        equals_expression: "{attr1} and {attr2}"

which would make a model like

class Dataset(BaseModel):
    attr1: str
    attr2: str

    @computed_field
    def publishable(self) -> bool:
        return (self.attr1 is not None) and (self.attr2 is not None)

or we could combine the approaches and make an extension to the metamodel so we get the best of both - clear metadata on the slots and also single model.

# ...
equals_expression: "all(x, PublishableSubset, x)"

where the first x is the logical condition (in this case, just being truthy) and the second x is the variable to use to refer to the items in the subset in the logical condition (so eg you could also do all(x > 5, PublishableSubset, x)

I think this is probably a common enough need that it would be worth making a way to express this neatly in linkml, so we can probably make the metamodel/generators come to you as well as y'all hacking around the metamodel/generators :).

So to keep a running summary of possible implementations:

multiple schemas
use rules
use equals_expression
modify the pydanticgen templates
subsets with some hackery at generation time
make some notion of 'context' when generating models that flips behavior
multiple classes + container class that switches between class depending on presence/absence of values