dandi / dandi-schema

Schemata for DANDI archive project
Apache License 2.0
5 stars 8 forks source link

Update to supporting Pydantic V2 #203

Closed candleindark closed 5 months ago

candleindark commented 7 months ago

This PR migrates Pydantic from V1 to V2. It closes #176.

This PR deprecates the following APIs:

  1. dandischema.models.DandiBaseModel.json_dict
    1. It can now be replaced by pydantic.BaseModel.model_dump(mode='json', exclude_none=True)
  2. dandischema.models,DandiBaseModel.unvalidated
    1. It can be replaced with pydantic.BaseModel.model_construct

Additional notes regarding changes entailed by Pydantic V2:

  1. When using dandischema.models.CommonModel.repository in APIs which expect a str, use str() to convert the attribute to a str.
    1. For more information, visit Url and Dsn types in pydantic.networks no longer inherit from str.
  2. Pydantic V2 generates a slightly different JSON schema for the same Pydantic model compared to Pydantic V1. In particular, the JSON schemas for Optional fields now indicate that the value null is allowed. Because this change in behavior is incompatible with required tests for dandi-schema, dandischema.utils.TransitionalGenerateJsonSchema has been defined to revert the behavior of the JSON schema generation for Optional fields back to the behavior exhibited by Pydantic V1.
    1. To generate the JSON schema for a model without the indication of the null value as for Pydantic V1, initiate the generation with Model.model_json_schema(schema_generator=TransitionalGenerateJsonSchema).
    2. For more information about the use of a GenerateJsonSchema class, please checkout these examples.
  3. Other changes in JSON schema generation in Pydantic V2 in comparison to Pydantic V1 that do not break current required tests for dandi-schema are the following:
    1. "definitions" is now "$defs"
    2. "default" is added for fields that have a default value
    3. "acknowledgement" in CommonModel now has "description" instead of "descriptions" due to a typo correction I put in. (This is not a change entailed by Pydantic V2 but from a correction I put in the code.)
    4. "maxLength" is no longer included in the JSON schema for a field of the type AnyHttpUrl. This is possibly due to the fact that all URL types in Pydantic V2 no longer inherit from str and there is no max length restriction.
    5. For a field of a Literal type of a single value, the "type" key is no longer specified in the corresponding JSON schema. For example, the JSON schema for a field of the type Literal["Asset"] no longer contains the "type" key with the value of "string". This change affects fields such as BareAsset.schemaKey.
    6. For an Enum subclass that has str values for corresponding choices, the new JSON schema will include a "type" key with the value of "string". This change affects pretty much all the Enum subclasses defined in this project.
    7. The docstring is no longer inherited from a parent Pydantic model and used as the value for the "description" key in the JSON schema. For example, the JSON schema for Project no longer has the "description" key because its parent, Activity, has a docstring, """Information about the Project activity""".

Note: This PR contains the "minimum" changes needed to support Pydantic V2. Other improvements to the project enabled by Pydantic V2 will come as subsequent PRs.

TODOs:

yarikoptic commented 7 months ago

This PR removes the following APIs.

  1. DandiBaseModel.json_dict

    1. Its use can now be replaced by pydantic.BaseModel.model_dump(mode='json', exclude_none=True)
  2. DandiBaseModel.unvalidated

    1. Its use can be replaced with pydantic.BaseModel.model_construct

Even if for "practice" I think it might good to provide a proper deprecation cycle for those happen there is a tool using dandischema directly and not using any "pydantic specifics": please keep those in API, but make them raise DeprecationWarning and let's schedule to drop those entirely in e.g. 4-6 months after the entire migration to pydantic 2.0 "settles down".

candleindark commented 7 months ago

This PR removes the following APIs.

  1. DandiBaseModel.json_dict

    1. Its use can now be replaced by pydantic.BaseModel.model_dump(mode='json', exclude_none=True)
  2. DandiBaseModel.unvalidated

    1. Its use can be replaced with pydantic.BaseModel.model_construct

Even if for "practice" I think it might good to provide a proper deprecation cycle for those happen there is a tool using dandischema directly and not using any "pydantic specifics": please keep those in API, but make them raise DeprecationWarning and let's schedule to drop those entirely in e.g. 4-6 months after the entire migration to pydantic 2.0 "settles down".

@satra I think @yarikoptic's has a valid point. May be we should keep those APIs in place for a while instead of removing them now, as you said in Monday's meeting. I can keep those API in place by wrapping the new features provided by Pydantic 2.0.

satra commented 7 months ago

@candleindark - i'm fine with a deprecation cycle. however, if nobody is using them, i'm also ok with the decision to remove. we will be doing a significant change anyway with pydantic 2. if there are significant changes made to other pieces (cli, server), then let's remove those functions. if not, we can put them into deprecated mode.

jwodder commented 7 months ago

@candleindark The tests are currently failing. Aside from the "Test against dandi-cli" ones (which will fail until dandi-cli is updated for Pydantic 2), you need to fix those before this can be merged.

candleindark commented 7 months ago

@candleindark The tests are currently failing. Aside from the "Test against dandi-cli" ones (which will fail until dandi-cli is updated for Pydantic 2), you need to fix those before this can be merged.

I will continue to get those failed test pass. The reason this is put for review is that you can give inputs and help expedite this.

Thanks for all the helps.

candleindark commented 6 months ago

@jwodder @satra Can consider removing the logic of the following lines from the JSON schema customization of an entire model and consolidated it into a type?

https://github.com/dandi/dandi-schema/blob/36753193205198b7855aed842af76d5e5829816f/dandischema/models.py#L448-L451

It seems to me that, at this point, the logic of these lines only affect the JSON schema for Organization.identifier and Affiliation.identifier, and the requirements for Organization.identifier and Affiliation.identifier are identical.

In Pydantic V2, an annotated type of str, RORID, can be defined in the following manner to have the desired validation/serialization behavior (Pydantic core schema) and JSON schema.

from typing import Type
import json

from typing_extensions import Annotated

from pydantic import BaseModel, GetCoreSchemaHandler, GetJsonSchemaHandler
from pydantic_core import CoreSchema, core_schema
from pydantic.json_schema import JsonSchemaValue

class _RORIDAnnotation:
    @classmethod
    def __get_pydantic_core_schema__(
        cls, source: Type[str], handler: GetCoreSchemaHandler
    ) -> CoreSchema:
        assert source is str

        return handler(
            core_schema.str_schema(
                pattern=r"^https://ror.org/[a-z0-9]+$",
                max_length=1_000,
            )
        )

    @classmethod
    def __get_pydantic_json_schema__(
        cls, kore_schema: CoreSchema, handler: GetJsonSchemaHandler
    ) -> JsonSchemaValue:
        json_schema = handler(kore_schema)

        json_schema["title"] = "A ror.org identifier"
        json_schema["description"] = "Use an ror.org identifier for institutions."
        json_schema["nskey"] = "schema"
        json_schema["format"] = "uri"

        return json_schema

RORID = Annotated[str, _RORIDAnnotation()]

# RORID can now be reused in different fields across different models
# It will always have toe same Pydantic core and JSON schemas
class Model1(BaseModel):
    identifier: RORID

class Model2(BaseModel):
    identifier: RORID

j_schema1 = json.dumps(Model1.model_json_schema(), indent=2)
print(j_schema1)
"""
{
  "properties": {
    "identifier": {
      "description": "Use an ror.org identifier for institutions.",
      "format": "uri",
      "maxLength": 1000,
      "nskey": "schema",
      "pattern": "^https://ror.org/[a-z0-9]+$",
      "title": "A ror.org identifier",
      "type": "string"
    }
  },
  "required": [
    "identifier"
  ],
  "title": "Model1",
  "type": "object"
}
"""

j_schema2 = json.dumps(Model2.model_json_schema(), indent=2)
print(j_schema2)
"""
{
  "properties": {
    "identifier": {
      "description": "Use an ror.org identifier for institutions.",
      "format": "uri",
      "maxLength": 1000,
      "nskey": "schema",
      "pattern": "^https://ror.org/[a-z0-9]+$",
      "title": "A ror.org identifier",
      "type": "string"
    }
  },
  "required": [
    "identifier"
  ],
  "title": "Model2",
  "type": "object"
}
"""

RORID can be used to annotate Organization.identifier, Affiliation.identifier, and any other places that requires a RORID, and you will always get the consistent expected Pydantic core and JSON schemas. Additionally, doing so, Organization.identifier and Affiliation.identifier will actually invalidate any string greater than 1000 in length (this is something that the current method doesn't do).

I am aware that the current method allows marking of any properties that possesses the URI pattern automatically, but the above method allows reusing a type with that completely binds validation/serialization behavior (Pydantic core schema) and JSON schema within, which allows higher predictability.

candleindark commented 6 months ago

It turns out that the RORID type from the last comment can be defined with less of the boilerplate.

import json

from pydantic import BaseModel, Field

from typing_extensions import Annotated

RORID = Annotated[
    str,
    Field(
        title="A ror.org identifier",
        description="Use an ror.org identifier for institutions.",
        pattern=r"^https://ror.org/[a-z0-9]+$",
        max_length=1_000,
        json_schema_extra={"nskey": "schema", "format": "uri"},
    ),
]

# RORID can now be reused in different fields across different models
# It will always have toe same Pydantic core and JSON schemas
class Model1(BaseModel):
    identifier: RORID

class Model2(BaseModel):
    identifier: RORID

j_schema1 = json.dumps(Model1.model_json_schema(), indent=2)
print(j_schema1)
"""
{
  "properties": {
    "identifier": {
      "description": "Use an ror.org identifier for institutions.",
      "format": "uri",
      "maxLength": 1000,
      "nskey": "schema",
      "pattern": "^https://ror.org/[a-z0-9]+$",
      "title": "A ror.org identifier",
      "type": "string"
    }
  },
  "required": [
    "identifier"
  ],
  "title": "Model1",
  "type": "object"
}
"""

j_schema2 = json.dumps(Model2.model_json_schema(), indent=2)
print(j_schema2)
"""
{
  "properties": {
    "identifier": {
      "description": "Use an ror.org identifier for institutions.",
      "format": "uri",
      "maxLength": 1000,
      "nskey": "schema",
      "pattern": "^https://ror.org/[a-z0-9]+$",
      "title": "A ror.org identifier",
      "type": "string"
    }
  },
  "required": [
    "identifier"
  ],
  "title": "Model2",
  "type": "object"
}
"""

or

import json

from pydantic import BaseModel, Field, StringConstraints

from typing_extensions import Annotated

RORID = Annotated[
    str,
    StringConstraints(pattern=r"^https://ror.org/[a-z0-9]+$", max_length=1_000),
    Field(
        title="A ror.org identifier",
        description="Use an ror.org identifier for institutions.",
        json_schema_extra={"nskey": "schema", "format": "uri"},
    ),
]

# RORID can now be reused in different fields across different models
# It will always have toe same Pydantic core and JSON schemas
class Model1(BaseModel):
    identifier: RORID

class Model2(BaseModel):
    identifier: RORID

j_schema1 = json.dumps(Model1.model_json_schema(), indent=2)
print(j_schema1)
"""
{
  "properties": {
    "identifier": {
      "description": "Use an ror.org identifier for institutions.",
      "format": "uri",
      "maxLength": 1000,
      "nskey": "schema",
      "pattern": "^https://ror.org/[a-z0-9]+$",
      "title": "A ror.org identifier",
      "type": "string"
    }
  },
  "required": [
    "identifier"
  ],
  "title": "Model1",
  "type": "object"
}
"""

j_schema2 = json.dumps(Model2.model_json_schema(), indent=2)
print(j_schema2)
"""
{
  "properties": {
    "identifier": {
      "description": "Use an ror.org identifier for institutions.",
      "format": "uri",
      "maxLength": 1000,
      "nskey": "schema",
      "pattern": "^https://ror.org/[a-z0-9]+$",
      "title": "A ror.org identifier",
      "type": "string"
    }
  },
  "required": [
    "identifier"
  ],
  "title": "Model2",
  "type": "object"
}
"""
codecov[bot] commented 6 months ago

Codecov Report

Attention: 13 lines in your changes are missing coverage. Please review.

Comparison is base (fcd0ba8) 97.72% compared to head (2d48a09) 91.99%.

Files Patch % Lines
dandischema/models.py 93.71% 11 Missing :warning:
dandischema/datacite.py 0.00% 1 Missing :warning:
dandischema/utils.py 93.33% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #203 +/- ## ========================================== - Coverage 97.72% 91.99% -5.73% ========================================== Files 17 18 +1 Lines 1755 1799 +44 ========================================== - Hits 1715 1655 -60 - Misses 40 144 +104 ``` | [Flag](https://app.codecov.io/gh/dandi/dandi-schema/pull/203/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dandi) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/dandi/dandi-schema/pull/203/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dandi) | `91.99% <95.42%> (-5.73%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dandi#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

candleindark commented 6 months ago

@jwodder @satra Can one of you tell me what the intent of the following lines are or if they are still needed at all at this point?

https://github.com/dandi/dandi-schema/blob/36753193205198b7855aed842af76d5e5829816f/dandischema/tests/test_models.py#L506-L509

These line are not executed by the test, and they have not been reachable in the test from the very beginning.

With the adjustments put in for the support of Pydantic V2, mypy is complaining that these lines being unreachable. Should I just removed them?

satra commented 6 months ago

@candleindark - if they are not reached, sure. i believe the change may have happened when the enumeration logic was changed.

satra commented 6 months ago

@jwodder and @AlmightyYakob - if we merge this, cli and archive will be out of sync. i believe we do want to do a full synced update. my plan is to merge this into master without a release label unless we can make the relevant PRs in cli and archive to go alongside this release?

jwodder commented 6 months ago

@satra

yarikoptic commented 6 months ago

I brought minor label back since unless we want to go major version boost, it should be here indeed.

FWIW, both archive and client limit to have <= 0.9 so we would be good to release dandischema if so desired to facilitate migration:

❯ grep dandischema dandi-{archive,cli}/setup.*
dandi-archive/setup.py:        'dandischema~=0.8.4',
dandi-cli/setup.cfg:    dandischema ~= 0.8.0

so I would say we release, and then strive to upgrade -cli and -archive and boost their versioned dependency on dandischema.

Even our dandi client conda feedstock is bound

/home/yoh/proj/dandi/dandi-feedstock
❯ grep dandischema recipe/meta.yaml
    - dandischema >=0.8.0,<0.9.0

so we could even upgrade conda package for the dandischema I guess

satra commented 6 months ago

ok let's release, but there may be others out there without such restrictions.

jwodder commented 6 months ago

Now that there's a "release" label on this PR, the "Test that old schemata are not modified" test is failing. I forget what the correct way to address this is.

satra commented 6 months ago

"Test that old schemata are not modified" test is failing.

to address it schema version needs to be updated. however, this PR should not have changed the schema. since we did not want to change the schema, it would be good figure out what the deltas are and see where they are coming from.

yarikoptic commented 6 months ago

from the printed git diff it is hard to assess on what is the difference -- may be just reordering (I think that doesn't matter) but most likely content changes. @candleindark can you quickly check at the level of the data structure (load jsons and check difference between them e.g. on high level if keys differ or values and how) and post the summary of what you find?

jwodder commented 6 months ago

Side note: If you're concerned about breakage in third-party code that uses dandischema, you can see a list of such code under Insights > Dependency Graph > Dependents. This won't catch all projects that use dandi-schema, and it may include some false positives, but it's a bigger list than you get by just checking PyPI.

candleindark commented 6 months ago

from the printed git diff it is hard to assess on what is the difference -- may be just reordering (I think that doesn't matter) but most likely content changes. @candleindark can you quickly check at the level of the data structure (load jsons and check difference between them e.g. on high level if keys differ or values and how) and post the summary of what you find?

There are many changes in the schema files. I will go over them as I local them, and I will go over the changes in context.json first in this post.

context.json has the least number of changes, only two. These are result of changes due to the change in our logic.

The two changes are:

    "genotype": {
      "@id": "dandi:genotype",
      "@container": "@set"
    },

to

    "genotype": {
      "@id": "dandi:genotype"
    },
    "value": {
      "@id": "schema:value",
      "@container": "@set"
    },

to

  "value": {
      "@id": "schema:value"
    },

These changes are result from the code change from https://github.com/dandi/dandi-schema/blob/a9d7f89e1ab739a6d7467a81c56e80e018d94cde/dandischema/metadata.py#L77-L78 to https://github.com/dandi/dandi-schema/blob/2fcc005c2cc76c31bf4d9fa1af0f006b8a56fccb/dandischema/metadata.py#L103-L104. This code change is based on @jwodder's suggestion which I think is more correct for Participant.genotype is of the type Optional[Union[List[GenotypeInfo], Identifier]] and PropertyValue.value is of the type Union[Any, List[Any]]. They are not always a list, and it may not be appropriate to mark them as a container always.

However, please let me know if you want to keep the old behavior which can be implemented using the old solution of stringifying.

Changes in other schema files are most likely due to the changes in how Pydantic generates JSON schema in V2. I have already reverted the change regarding JSON schema for Optional fields in this PR, but there are other changes as well, and I will go over them when I have more details in subsequent posts. For a general picture of what has changed in JSON schema generation in Pydantic, please visit https://docs.pydantic.dev/latest/migration/#changes-to-json-schema-generation.

satra commented 6 months ago

@candleindark - for the context generator, could we keep it as before for now? json-ld doesn't have the same notion as pydantic or json schema about multiple value types. let's file an issue to revisit the json-ld context so we don't forget about it.

yarikoptic commented 6 months ago

. This code change is based on @jwodder's suggestion which I think is more correct for Participant.genotype is of the type Optional[Union[List[GenotypeInfo], Identifier]] and PropertyValue.value is of the type Union[Any, List[Any]]. They are not always a list, and it may not be appropriate to mark them as a container always. However, please let me know if you want to keep the old behavior which can be implemented using the old solution of stringifying.

Let's keep old behavior but annotate (and create an issue pointing to it) for what we could plan to "break" for 0.7.0 release of schema -- I created a milestone https://github.com/dandi/dandi-schema/milestone/3 for those so we do not forget any of them to be considered

candleindark commented 6 months ago

@satra @yarikoptic The behavior of the context generator has been restored. https://github.com/dandi/dandi-schema/blob/0c97b8eac7a601fab2903eac7c0439ec021aa2f9/dandischema/metadata.py#L93-L97

Issue #214 has been filed for the consideration of an alternative solution.

candleindark commented 6 months ago

@yarikoptic @satra All the differences in the JSON schema files have been thoroughly inspected. All of these differences are summarized in the top post in this issue.

To see some of the differences yourself, a good tool to use is https://jsondiff.com/. (For the differences between the "definitions" object and its replacement the "$defs" object, you need to compare them in top level).

jwodder commented 6 months ago

@candleindark So assuming that all remaining changes are non-functional and we're OK with the schema being changed accordingly, I believe the way to get the tests to stop failing is to increment DANDI_SCHEMA_VERSION in dandischema/consts.py to "0.6.5". In addition, "0.6.4" should presumably be added to ALLOWED_INPUT_SCHEMAS.

yarikoptic commented 6 months ago

assuming that all remaining changes are non-functional

well, that is a question really here. Given an extended list of differences (thank you @candleindark for preparing it! ) even if look non-functional, I am not sure we could claim them to be such! Clearly, if looking at output schema -- it has changed in incompatible ways (some fields are added/renamed/removed). The safest course of action might be to make it into "0.7.0" schema version. WDYT @dandi/dandiarchive ? I added a note for the need to verify compatibility with jsonschema into description of https://github.com/dandi/dandi-archive/issues/1791#issue-2047112455

candleindark commented 6 months ago

@jwodder @yarikoptic I am not clear about the course of action from this point. Do I need to do anything to this PR from this point? Should I do what John recommends but use version "0.7.0" instead of "0.6.5"?

yarikoptic commented 5 months ago

current state: I boosted schema version to 0.7.0 and we need to review if any upgrades to the dict serialization are needed in migrate function where I added a placeholder in the code

satra commented 5 months ago

@candleindark - could you please address the linting error?

@yarikoptic - i'm assuming we are releasing without the CLI changes? shall we release once the linting error is fixed?

jwodder commented 5 months ago

@satra No, we can't release yet, as the TODO item in this commit is still undone.

satra commented 5 months ago

there hasn't been any data model changes, has there? and i'm assuming the migrate tests all pass? the only time a migration code change is required is if we change the data model.

jwodder commented 5 months ago

I've converted this PR to a draft in order to prevent accidental merging until we can confirm and are sure that all necessary changes have been made.

candleindark commented 5 months ago

The test failed. It seems to be a github problem, but I don't have the option to re-run it.

candleindark commented 5 months ago

The test failed. It seems to be a github problem, but I don't have the option to re-run it.

Did someone just re-ran the test for me. It just re-ran and succeeded.

yarikoptic commented 5 months ago

I resolved merge conflict - trivial conflict in imports