crim-ca / stac-populator

Workflow logic to populate STAC catalog with demo datasets.
MIT License
2 stars 2 forks source link

Abstractions to extend catalog to other data models #57

Open huard opened 1 month ago

huard commented 1 month ago

I'm working on a new CMIP6-CORDEX extension and implementation for the catalog, so we can deploy it at Ouranos.

I'm hitting a few issues that I wanted to discuss.

Abstraction

Adding a new extension and implementation requires copy-pasting a fair amount of boilerplate code, where we only change class names and one attribute name (CMIP6 -> CORDEX6). I don't mind doing this once, but we'll want to add other data models in the future and I can see the mess it would generate to have multiple copies of the same logic. I think this calls for a higher level abstraction than what we have now.

Validation

The current CMIP6 extension does validation at two places: in the Pydantic base model, and in the pystac.Item through jsonschema. Deepak created a custom jsonschema for the CMIP data that we currently use. There's now an official cmip6 jsonschema, but I don't think it would validate against the UoT catalog. It looks to be made for ESGF-held datasets, which apparently hold additional attributes.

None of these two schemas actually check that fields comply with the CV, they just make sure a value is present. So we use a pydantic.BaseModel to check that metadata attributes conform to the Controlled Vocabulary (CV).

Then we call item.validate(), which compares STAC Item attributes against Deepack' jsonschema.

My experience so far is that it's much easier to debug errors with Pydantic traceback than with the pystac.validation traceback. However, if a jsonschema already exists, I don't want to recopy that logic in pydantic. Also, our pydantic data model only checks metadata attributes, not the STAC attributes.

Customization

My experience is that catalog admins want to have some flexibility regarding the metadata to include in catalog entries. If a jsonschema exist, you might not want to include all of its properties in your catalog, and you may want to include additional properties not found in the schema. My understanding is that you cannot "unrequire" a property by subclassing or extending a jsonschema, meaning we'd have to maintain modified versions of "official" schemas to achieve the flexibility required. So in the Abstractions discussed above, I think we need to account for the need to decouple official external jsonschemas for metadata attributes from our STAC records.

huard commented 1 month ago

What we have:

Desiderata:

fmigneault commented 1 month ago

I think this calls for a higher level abstraction than what we have now.

I agree.

None of these two schemas actually check that fields comply with the CV, they just make sure a value is present. So we use a pydantic.BaseModel to check that metadata attributes conform to the Controlled Vocabulary (CV).

I believe this is something that should be updated in the extensions. The extensions should define enums as appropriate per well-known CV properties to validate the same way through JSON schemas. Those schemas could be generated from pydantic or other definitions, but they should ideally validate as much as what is done with custom python validations.

My experience so far is that it's much easier to debug errors with Pydantic traceback than with the pystac.validation traceback.

I've added this a while back: https://github.com/stac-utils/pystac/pull/1320 Maybe something to consider to inject a more verbose/efficient validation error reporting? Otherwise, maybe directly improve error reporting produced by pystac natively...

Can you provide an example of a problematic case that errors were easier to analyze with pydantic over pystac? Maybe the adjustment can be applied based on that.

huard commented 1 month ago

I have two issues with pystac validation:

  1. The error message includes the schema and the attributes. Since both are large, I need to scroll very far back up to see the actual traceback.
  2. The error message only shows the best match, not all errors. This led me to believe there was a problem with one item, whereas the issue was that I was validating a dictionary against a dictionary of dictionary. I think it would be better to print all error messages.
fmigneault commented 1 month ago

I think it should be possible to address both points fairly easily. Thanks for detailing.

I have faced (2) as well. Iterating over the nested errors and print them one by one can solve this, instead of "only" reporting the generic top-level error. It creates error steps similar to a traceback. For (1), I think they could simply be omitted, since iterating over errors could print the nested JSON fields along the way, giving enough context to interpret the full JSON.