Closed candleindark closed 5 months ago
This PR removes the following APIs.
DandiBaseModel.json_dict
- Its use can now be replaced by
pydantic.BaseModel.model_dump(mode='json', exclude_none=True)
DandiBaseModel.unvalidated
- Its use can be replaced with
pydantic.BaseModel.model_construct
Even if for "practice" I think it might good to provide a proper deprecation cycle for those happen there is a tool using dandischema directly and not using any "pydantic specifics": please keep those in API, but make them raise DeprecationWarning
and let's schedule to drop those entirely in e.g. 4-6 months after the entire migration to pydantic 2.0 "settles down".
This PR removes the following APIs.
DandiBaseModel.json_dict
- Its use can now be replaced by
pydantic.BaseModel.model_dump(mode='json', exclude_none=True)
DandiBaseModel.unvalidated
- Its use can be replaced with
pydantic.BaseModel.model_construct
Even if for "practice" I think it might good to provide a proper deprecation cycle for those happen there is a tool using dandischema directly and not using any "pydantic specifics": please keep those in API, but make them raise
DeprecationWarning
and let's schedule to drop those entirely in e.g. 4-6 months after the entire migration to pydantic 2.0 "settles down".
@satra I think @yarikoptic's has a valid point. May be we should keep those APIs in place for a while instead of removing them now, as you said in Monday's meeting. I can keep those API in place by wrapping the new features provided by Pydantic 2.0.
@candleindark - i'm fine with a deprecation cycle. however, if nobody is using them, i'm also ok with the decision to remove. we will be doing a significant change anyway with pydantic 2. if there are significant changes made to other pieces (cli, server), then let's remove those functions. if not, we can put them into deprecated mode.
@candleindark The tests are currently failing. Aside from the "Test against dandi-cli" ones (which will fail until dandi-cli is updated for Pydantic 2), you need to fix those before this can be merged.
@candleindark The tests are currently failing. Aside from the "Test against dandi-cli" ones (which will fail until dandi-cli is updated for Pydantic 2), you need to fix those before this can be merged.
I will continue to get those failed test pass. The reason this is put for review is that you can give inputs and help expedite this.
Thanks for all the helps.
@jwodder @satra Can consider removing the logic of the following lines from the JSON schema customization of an entire model and consolidated it into a type?
It seems to me that, at this point, the logic of these lines only affect the JSON schema for Organization.identifier
and Affiliation.identifier
, and the requirements for Organization.identifier
and Affiliation.identifier
are identical.
In Pydantic V2, an annotated type of str
, RORID
, can be defined in the following manner to have the desired validation/serialization behavior (Pydantic core schema) and JSON schema.
from typing import Type
import json
from typing_extensions import Annotated
from pydantic import BaseModel, GetCoreSchemaHandler, GetJsonSchemaHandler
from pydantic_core import CoreSchema, core_schema
from pydantic.json_schema import JsonSchemaValue
class _RORIDAnnotation:
@classmethod
def __get_pydantic_core_schema__(
cls, source: Type[str], handler: GetCoreSchemaHandler
) -> CoreSchema:
assert source is str
return handler(
core_schema.str_schema(
pattern=r"^https://ror.org/[a-z0-9]+$",
max_length=1_000,
)
)
@classmethod
def __get_pydantic_json_schema__(
cls, kore_schema: CoreSchema, handler: GetJsonSchemaHandler
) -> JsonSchemaValue:
json_schema = handler(kore_schema)
json_schema["title"] = "A ror.org identifier"
json_schema["description"] = "Use an ror.org identifier for institutions."
json_schema["nskey"] = "schema"
json_schema["format"] = "uri"
return json_schema
RORID = Annotated[str, _RORIDAnnotation()]
# RORID can now be reused in different fields across different models
# It will always have toe same Pydantic core and JSON schemas
class Model1(BaseModel):
identifier: RORID
class Model2(BaseModel):
identifier: RORID
j_schema1 = json.dumps(Model1.model_json_schema(), indent=2)
print(j_schema1)
"""
{
"properties": {
"identifier": {
"description": "Use an ror.org identifier for institutions.",
"format": "uri",
"maxLength": 1000,
"nskey": "schema",
"pattern": "^https://ror.org/[a-z0-9]+$",
"title": "A ror.org identifier",
"type": "string"
}
},
"required": [
"identifier"
],
"title": "Model1",
"type": "object"
}
"""
j_schema2 = json.dumps(Model2.model_json_schema(), indent=2)
print(j_schema2)
"""
{
"properties": {
"identifier": {
"description": "Use an ror.org identifier for institutions.",
"format": "uri",
"maxLength": 1000,
"nskey": "schema",
"pattern": "^https://ror.org/[a-z0-9]+$",
"title": "A ror.org identifier",
"type": "string"
}
},
"required": [
"identifier"
],
"title": "Model2",
"type": "object"
}
"""
RORID
can be used to annotate Organization.identifier
, Affiliation.identifier
, and any other places that requires a RORID
, and you will always get the consistent expected Pydantic core and JSON schemas. Additionally, doing so, Organization.identifier
and Affiliation.identifier
will actually invalidate any string greater than 1000 in length (this is something that the current method doesn't do).
I am aware that the current method allows marking of any properties that possesses the URI pattern automatically, but the above method allows reusing a type with that completely binds validation/serialization behavior (Pydantic core schema) and JSON schema within, which allows higher predictability.
It turns out that the RORID
type from the last comment can be defined with less of the boilerplate.
import json
from pydantic import BaseModel, Field
from typing_extensions import Annotated
RORID = Annotated[
str,
Field(
title="A ror.org identifier",
description="Use an ror.org identifier for institutions.",
pattern=r"^https://ror.org/[a-z0-9]+$",
max_length=1_000,
json_schema_extra={"nskey": "schema", "format": "uri"},
),
]
# RORID can now be reused in different fields across different models
# It will always have toe same Pydantic core and JSON schemas
class Model1(BaseModel):
identifier: RORID
class Model2(BaseModel):
identifier: RORID
j_schema1 = json.dumps(Model1.model_json_schema(), indent=2)
print(j_schema1)
"""
{
"properties": {
"identifier": {
"description": "Use an ror.org identifier for institutions.",
"format": "uri",
"maxLength": 1000,
"nskey": "schema",
"pattern": "^https://ror.org/[a-z0-9]+$",
"title": "A ror.org identifier",
"type": "string"
}
},
"required": [
"identifier"
],
"title": "Model1",
"type": "object"
}
"""
j_schema2 = json.dumps(Model2.model_json_schema(), indent=2)
print(j_schema2)
"""
{
"properties": {
"identifier": {
"description": "Use an ror.org identifier for institutions.",
"format": "uri",
"maxLength": 1000,
"nskey": "schema",
"pattern": "^https://ror.org/[a-z0-9]+$",
"title": "A ror.org identifier",
"type": "string"
}
},
"required": [
"identifier"
],
"title": "Model2",
"type": "object"
}
"""
or
import json
from pydantic import BaseModel, Field, StringConstraints
from typing_extensions import Annotated
RORID = Annotated[
str,
StringConstraints(pattern=r"^https://ror.org/[a-z0-9]+$", max_length=1_000),
Field(
title="A ror.org identifier",
description="Use an ror.org identifier for institutions.",
json_schema_extra={"nskey": "schema", "format": "uri"},
),
]
# RORID can now be reused in different fields across different models
# It will always have toe same Pydantic core and JSON schemas
class Model1(BaseModel):
identifier: RORID
class Model2(BaseModel):
identifier: RORID
j_schema1 = json.dumps(Model1.model_json_schema(), indent=2)
print(j_schema1)
"""
{
"properties": {
"identifier": {
"description": "Use an ror.org identifier for institutions.",
"format": "uri",
"maxLength": 1000,
"nskey": "schema",
"pattern": "^https://ror.org/[a-z0-9]+$",
"title": "A ror.org identifier",
"type": "string"
}
},
"required": [
"identifier"
],
"title": "Model1",
"type": "object"
}
"""
j_schema2 = json.dumps(Model2.model_json_schema(), indent=2)
print(j_schema2)
"""
{
"properties": {
"identifier": {
"description": "Use an ror.org identifier for institutions.",
"format": "uri",
"maxLength": 1000,
"nskey": "schema",
"pattern": "^https://ror.org/[a-z0-9]+$",
"title": "A ror.org identifier",
"type": "string"
}
},
"required": [
"identifier"
],
"title": "Model2",
"type": "object"
}
"""
Attention: 13 lines
in your changes are missing coverage. Please review.
Comparison is base (
fcd0ba8
) 97.72% compared to head (2d48a09
) 91.99%.
Files | Patch % | Lines |
---|---|---|
dandischema/models.py | 93.71% | 11 Missing :warning: |
dandischema/datacite.py | 0.00% | 1 Missing :warning: |
dandischema/utils.py | 93.33% | 1 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@jwodder @satra Can one of you tell me what the intent of the following lines are or if they are still needed at all at this point?
These line are not executed by the test, and they have not been reachable in the test from the very beginning.
With the adjustments put in for the support of Pydantic V2, mypy is complaining that these lines being unreachable. Should I just removed them?
@candleindark - if they are not reached, sure. i believe the change may have happened when the enumeration logic was changed.
@jwodder and @AlmightyYakob - if we merge this, cli and archive will be out of sync. i believe we do want to do a full synced update. my plan is to merge this into master without a release label unless we can make the relevant PRs in cli and archive to go alongside this release?
@satra
I brought minor
label back since unless we want to go major
version boost, it should be here indeed.
FWIW, both archive and client limit to have <= 0.9
so we would be good to release dandischema if so desired to facilitate migration:
❯ grep dandischema dandi-{archive,cli}/setup.*
dandi-archive/setup.py: 'dandischema~=0.8.4',
dandi-cli/setup.cfg: dandischema ~= 0.8.0
so I would say we release, and then strive to upgrade -cli and -archive and boost their versioned dependency on dandischema.
Even our dandi client conda feedstock is bound
/home/yoh/proj/dandi/dandi-feedstock
❯ grep dandischema recipe/meta.yaml
- dandischema >=0.8.0,<0.9.0
so we could even upgrade conda package for the dandischema I guess
ok let's release, but there may be others out there without such restrictions.
Now that there's a "release" label on this PR, the "Test that old schemata are not modified" test is failing. I forget what the correct way to address this is.
"Test that old schemata are not modified" test is failing.
to address it schema version needs to be updated. however, this PR should not have changed the schema. since we did not want to change the schema, it would be good figure out what the deltas are and see where they are coming from.
from the printed git diff
it is hard to assess on what is the difference -- may be just reordering (I think that doesn't matter) but most likely content changes. @candleindark can you quickly check at the level of the data structure (load jsons and check difference between them e.g. on high level if keys differ or values and how) and post the summary of what you find?
Side note: If you're concerned about breakage in third-party code that uses dandischema, you can see a list of such code under Insights > Dependency Graph > Dependents. This won't catch all projects that use dandi-schema, and it may include some false positives, but it's a bigger list than you get by just checking PyPI.
from the printed
git diff
it is hard to assess on what is the difference -- may be just reordering (I think that doesn't matter) but most likely content changes. @candleindark can you quickly check at the level of the data structure (load jsons and check difference between them e.g. on high level if keys differ or values and how) and post the summary of what you find?
There are many changes in the schema files. I will go over them as I local them, and I will go over the changes in context.json
first in this post.
context.json
has the least number of changes, only two. These are result of changes due to the change in our logic.
The two changes are:
"genotype": {
"@id": "dandi:genotype",
"@container": "@set"
},
to
"genotype": {
"@id": "dandi:genotype"
},
"value": {
"@id": "schema:value",
"@container": "@set"
},
to
"value": {
"@id": "schema:value"
},
These changes are result from the code change from https://github.com/dandi/dandi-schema/blob/a9d7f89e1ab739a6d7467a81c56e80e018d94cde/dandischema/metadata.py#L77-L78 to https://github.com/dandi/dandi-schema/blob/2fcc005c2cc76c31bf4d9fa1af0f006b8a56fccb/dandischema/metadata.py#L103-L104. This code change is based on @jwodder's suggestion which I think is more correct for Participant.genotype
is of the type Optional[Union[List[GenotypeInfo], Identifier]]
and PropertyValue.value
is of the type Union[Any, List[Any]]
. They are not always a list, and it may not be appropriate to mark them as a container always.
However, please let me know if you want to keep the old behavior which can be implemented using the old solution of stringifying.
Changes in other schema files are most likely due to the changes in how Pydantic generates JSON schema in V2. I have already reverted the change regarding JSON schema for Optional fields in this PR, but there are other changes as well, and I will go over them when I have more details in subsequent posts. For a general picture of what has changed in JSON schema generation in Pydantic, please visit https://docs.pydantic.dev/latest/migration/#changes-to-json-schema-generation.
@candleindark - for the context generator, could we keep it as before for now? json-ld doesn't have the same notion as pydantic or json schema about multiple value types. let's file an issue to revisit the json-ld context so we don't forget about it.
. This code change is based on @jwodder's suggestion which I think is more correct for
Participant.genotype
is of the typeOptional[Union[List[GenotypeInfo], Identifier]]
andPropertyValue.value
is of the typeUnion[Any, List[Any]]
. They are not always a list, and it may not be appropriate to mark them as a container always. However, please let me know if you want to keep the old behavior which can be implemented using the old solution of stringifying.
Let's keep old behavior but annotate (and create an issue pointing to it) for what we could plan to "break" for 0.7.0 release of schema -- I created a milestone https://github.com/dandi/dandi-schema/milestone/3 for those so we do not forget any of them to be considered
@satra @yarikoptic The behavior of the context generator has been restored. https://github.com/dandi/dandi-schema/blob/0c97b8eac7a601fab2903eac7c0439ec021aa2f9/dandischema/metadata.py#L93-L97
Issue #214 has been filed for the consideration of an alternative solution.
@yarikoptic @satra All the differences in the JSON schema files have been thoroughly inspected. All of these differences are summarized in the top post in this issue.
To see some of the differences yourself, a good tool to use is https://jsondiff.com/. (For the differences between the "definitions"
object and its replacement the "$defs"
object, you need to compare them in top level).
@candleindark So assuming that all remaining changes are non-functional and we're OK with the schema being changed accordingly, I believe the way to get the tests to stop failing is to increment DANDI_SCHEMA_VERSION
in dandischema/consts.py
to "0.6.5"
. In addition, "0.6.4"
should presumably be added to ALLOWED_INPUT_SCHEMAS
.
assuming that all remaining changes are non-functional
well, that is a question really here. Given an extended list of differences (thank you @candleindark for preparing it! ) even if look non-functional, I am not sure we could claim them to be such! Clearly, if looking at output schema -- it has changed in incompatible ways (some fields are added/renamed/removed). The safest course of action might be to make it into "0.7.0" schema version. WDYT @dandi/dandiarchive ? I added a note for the need to verify compatibility with jsonschema into description of https://github.com/dandi/dandi-archive/issues/1791#issue-2047112455
@jwodder @yarikoptic I am not clear about the course of action from this point. Do I need to do anything to this PR from this point? Should I do what John recommends but use version "0.7.0" instead of "0.6.5"?
current state: I boosted schema version to 0.7.0 and we need to review if any upgrades to the dict
serialization are needed in migrate
function where I added a placeholder in the code
@candleindark - could you please address the linting error?
@yarikoptic - i'm assuming we are releasing without the CLI changes? shall we release once the linting error is fixed?
@satra No, we can't release yet, as the TODO item in this commit is still undone.
there hasn't been any data model changes, has there? and i'm assuming the migrate tests all pass? the only time a migration code change is required is if we change the data model.
I've converted this PR to a draft in order to prevent accidental merging until we can confirm and are sure that all necessary changes have been made.
The test failed. It seems to be a github problem, but I don't have the option to re-run it.
The test failed. It seems to be a github problem, but I don't have the option to re-run it.
Did someone just re-ran the test for me. It just re-ran and succeeded.
I resolved merge conflict - trivial conflict in imports
This PR migrates Pydantic from V1 to V2. It closes #176.
This PR deprecates the following APIs:
dandischema.models.DandiBaseModel.json_dict
pydantic.BaseModel.model_dump(mode='json', exclude_none=True)
dandischema.models,DandiBaseModel.unvalidated
pydantic.BaseModel.model_construct
Additional notes regarding changes entailed by Pydantic V2:
dandischema.models.CommonModel.repository
in APIs which expect astr
, usestr()
to convert the attribute to astr
.pydantic.networks
no longer inherit fromstr
.Optional
fields now indicate that the valuenull
is allowed. Because this change in behavior is incompatible with required tests for dandi-schema,dandischema.utils.TransitionalGenerateJsonSchema
has been defined to revert the behavior of the JSON schema generation forOptional
fields back to the behavior exhibited by Pydantic V1.null
value as for Pydantic V1, initiate the generation withModel.model_json_schema(schema_generator=TransitionalGenerateJsonSchema)
.GenerateJsonSchema
class, please checkout these examples.CommonModel
now has "description" instead of "descriptions" due to a typo correction I put in. (This is not a change entailed by Pydantic V2 but from a correction I put in the code.)AnyHttpUrl
. This is possibly due to the fact that all URL types in Pydantic V2 no longer inherit fromstr
and there is no max length restriction.Literal
type of a single value, the "type" key is no longer specified in the corresponding JSON schema. For example, the JSON schema for a field of the typeLiteral["Asset"]
no longer contains the "type" key with the value of "string". This change affects fields such asBareAsset.schemaKey
.Enum
subclass that hasstr
values for corresponding choices, the new JSON schema will include a"type"
key with the value of"string"
. This change affects pretty much all theEnum
subclasses defined in this project."description"
key in the JSON schema. For example, the JSON schema forProject
no longer has the"description"
key because its parent,Activity
, has a docstring,"""Information about the Project activity"""
.Note: This PR contains the "minimum" changes needed to support Pydantic V2. Other improvements to the project enabled by Pydantic V2 will come as subsequent PRs.
TODOs: