Open huard opened 1 month ago
What we have:
Loader
, which returns attributes.schema
describing a subset of the attributes returned by the Loader. This schema might preclude additional
properties, so it cannot be applied wholesale to the Loader's output.data model
describing the content we want included in the catalog. It includes a subset of the schema properties,
as well as additional attributes desired by the catalog admins.Desiderata:
I think this calls for a higher level abstraction than what we have now.
I agree.
None of these two schemas actually check that fields comply with the CV, they just make sure a value is present. So we use a pydantic.BaseModel to check that metadata attributes conform to the Controlled Vocabulary (CV).
I believe this is something that should be updated in the extensions. The extensions should define enums as appropriate per well-known CV properties to validate the same way through JSON schemas. Those schemas could be generated from pydantic
or other definitions, but they should ideally validate as much as what is done with custom python validations.
My experience so far is that it's much easier to debug errors with Pydantic traceback than with the pystac.validation traceback.
I've added this a while back: https://github.com/stac-utils/pystac/pull/1320
Maybe something to consider to inject a more verbose/efficient validation error reporting?
Otherwise, maybe directly improve error reporting produced by pystac
natively...
Can you provide an example of a problematic case that errors were easier to analyze with pydantic
over pystac
? Maybe the adjustment can be applied based on that.
I have two issues with pystac validation:
I think it should be possible to address both points fairly easily. Thanks for detailing.
I have faced (2) as well. Iterating over the nested errors and print them one by one can solve this, instead of "only" reporting the generic top-level error. It creates error steps similar to a traceback. For (1), I think they could simply be omitted, since iterating over errors could print the nested JSON fields along the way, giving enough context to interpret the full JSON.
I'm working on a new CMIP6-CORDEX extension and implementation for the catalog, so we can deploy it at Ouranos.
I'm hitting a few issues that I wanted to discuss.
Abstraction
Adding a new extension and implementation requires copy-pasting a fair amount of boilerplate code, where we only change class names and one attribute name (CMIP6 -> CORDEX6). I don't mind doing this once, but we'll want to add other data models in the future and I can see the mess it would generate to have multiple copies of the same logic. I think this calls for a higher level abstraction than what we have now.
Validation
The current CMIP6 extension does validation at two places: in the Pydantic base model, and in the pystac.Item through jsonschema. Deepak created a custom jsonschema for the CMIP data that we currently use. There's now an official cmip6 jsonschema, but I don't think it would validate against the UoT catalog. It looks to be made for ESGF-held datasets, which apparently hold additional attributes.
None of these two schemas actually check that fields comply with the CV, they just make sure a value is present. So we use a pydantic.BaseModel to check that metadata attributes conform to the Controlled Vocabulary (CV).
Then we call item.validate(), which compares STAC Item attributes against Deepack' jsonschema.
My experience so far is that it's much easier to debug errors with Pydantic traceback than with the pystac.validation traceback. However, if a jsonschema already exists, I don't want to recopy that logic in pydantic. Also, our pydantic data model only checks metadata attributes, not the STAC attributes.
Customization
My experience is that catalog admins want to have some flexibility regarding the metadata to include in catalog entries. If a jsonschema exist, you might not want to include all of its properties in your catalog, and you may want to include additional properties not found in the schema. My understanding is that you cannot "unrequire" a property by subclassing or extending a jsonschema, meaning we'd have to maintain modified versions of "official" schemas to achieve the flexibility required. So in the Abstractions discussed above, I think we need to account for the need to decouple official external jsonschemas for metadata attributes from our STAC records.