Support foreign keys property

ChasNelson1990 commented 1 year ago

Context:

We have multiple CKAN datasets that are composed of a set of resources.
In each of these datasets, one of these resources is a list of geographical regions.
All other resources in these datasets include a column for geographical region.
We want to use the foreignKeys prop (https://specs.frictionlessdata.io/table-schema/#foreign-keys) in frictionless to ensure that the choices in said column are present, and only present, in the geographical regions resource.

Good news:

This is supported upstream by frictionless :tada:

Less good news:

It requires treating two tables and two schemas as one Data Package

Idea No. 1:

One way to implement this is that the ckanext-validation could check schemas for foreignKeys and, if found, bundle both tables and both Table Schemas and send it all to frictionless's validate with type=package, e.g. we could have a _validate_data_package to match https://github.com/frictionlessdata/ckanext-validation/blob/9c23581d34289536139cc81f7b7ddb756dab64f6/ckanext/validation/jobs.py#L139

Idea No. 2:

An alternative is to have an entirely parallel set of package actions, e.g. https://github.com/frictionlessdata/ckanext-validation/blob/9c23581d34289536139cc81f7b7ddb756dab64f6/ckanext/validation/logic.py#L68 becomes package_validation_run and in these actions we treat the whole CKAN dataset (plus schemas) as a frictionless Data Package

We have a small amount of time to dedicate to this and would like to make a change that can be merged here (rather than maintaining our own fork) and so we'd really appreciate thoughts and ideas from existing maintainers and contributors.

All thoughts welcome :smile:

aivuk commented 1 year ago

Hi @ChasNelson1990, thanks for your idea and it makes sense to have this feature. We are currently working in a new update for the ckanext-validation that is going to be released still this week, then I would suggest you to first wait for the PR to be merged before starting working on it.

Regarding how to implement it, I still need to think a bit on the best approach but I'm prone to the idea no. 1. Let's see what @amercader has to say about it.

amercader commented 1 year ago

@ChasNelson1990 this certainly sounds interesting and I'm also leaning towards option 1, but I'm a bit unclear on how we would bundle both tables. Off the top of my head maybe something like this could work:

Looking at the foreignKeys spec, the outer resource is referenced by the name of an existing resource in the data package. In the context of CKAN I guess this could be the id of the resource that holds eg the geographic descriptor (or even a dataset_id:resource_id variant). Down the line there could be an UI to assign these in the schema editor but I wouldn't worry about it now.
Before running the validation job, we check if there are foreign keys and we get the metadata details of the resources needed (we need at least the url, is there any other property from the outer resource we need?)
We construct a virtual datapackage that include both resources (the original being validated and the referenced one(s)) and run the validation with type=package as you suggest

Things to consider:

What happens if the reference does not work (resource was deleted, wrong fields defined, etc)
Does validating FKs has an impact on the generated reports? can we still use the same current component?

@ChasNelson1990 let me know if this makes sense or I missed something

ChasNelson1990 commented 1 year ago

Thanks for the replies guys. I think everything makes sense and it feels like we are on the same page.

For some greater context @amercader here is what Fjelltopp do in our forks.

We have something like the following in our schema:

    "foreignKeys": [
        {
            "fields": "area_id",
            "reference": {
                "resource": "3_geojson_inputs",
                "fields": "area_id"
            }
        }
    ],

Where resource is the schema name of the other resource within the dataset.

If I am remembering correctly this may depend on some changes in our fork of ckanext-scheming allowing us to define a package schema with required resources and point to unique schemas for each resource, e.g. https://github.com/fjelltopp/unaids_data_specifications/blob/eb97a7d956936e3bfa6f3fe36d869249a8c4c525/package_schemas/country_estimates/3_country_estimates_23.json#L125. Those fork changes should not be necessary to enable foreign key validation here though.

We then check if the schema has foreignKeys and then get the appropriate data and pass it to (here) goodtables, e.g. https://github.com/fjelltopp/ckanext-validation/blob/8672ad5fa42405f0e8aaa40204cdec8676df13c9/ckanext/validation/jobs.py#L155. We know that the way we are doing this is very slow and we have never optimised this (and it still uses goodtables).

As you say, we could just provide a repeatable identifier that allows us to infer the resource and pass a 'virtual' data package to frictionless.

What happens if the reference does not work (resource was deleted, wrong fields defined, etc)

When we do a resource validation with frictionless it does a foreign key check and throws a few errors we could catch: https://github.com/frictionlessdata/framework/blob/ec2d8336dfa32f2396b38e4f33e3a7e9e4ff415a/frictionless/resource/resource.py#L718. This would cover a resource not being in the package, e.g. because it has been deleted? We would then want to return that as a "Warning" I would guess as that seems to be where we keep framework issues as opposed to validation issues?

And, while I am not sure I can find the line, I would presume it validates the schema as part of validating a resource? Because schema validation throws a known error if wrong fields are defined: https://github.com/frictionlessdata/framework/blob/ec2d8336dfa32f2396b38e4f33e3a7e9e4ff415a/frictionless/schema/schema.py#L350. Again this could be caught and returned as a warning.

Does validating FKs has an impact on the generated reports? can we still use the same current component?

This I have no idea about - is there an easy way to test / explore this?

tomeksabala commented 1 year ago

Hello @amercader and @aivuk . I'm working togheter with @ChasNelson1990 on the issue. I'd like to bump the issue a little bit and make sure we're on the same page about:

@amercader so would you agree that presented option 1 is a sensible solution? We're happy to create a draft PR with it for further review to continue a more detailed discussion. Adria, could you also confirm Chas is properly adressing your comments from earlier?
@aivuk What's the current timeline for finalizing https://github.com/frictionlessdata/ckanext-validation/pull/83 ? We'd like to branch our work from this PR and would like to avoid possible rebase conflicts.

amercader commented 1 year ago

@tomeksabala Yes, option 1 is the more sensible and @ChasNelson1990 comments make sense, although the part that is less clear to me is the "get the appropriate data and pass it to goodtables". IIUC the proposed approach, we would somehow create a virtual datapackage that links to all tables involved as resources and let frictionless deal with the data loading.

But as you say all this would be much easier to discuss with actual code so a draft PR would be great.

With regards to #83 we are actively working on it so it's hard to give a timeline but the goal is to finish the uploader work in the next couple of weeks, merge to master and then carry out further work (schema editor ec) as new feature branches. In any case most of the work there affects the UI and custom endpoints, we are not touching the jobs module which is where I expect most of your work to be done so you can work of current master if you want.

kforenc commented 1 year ago

Hi @amercader and @aivuk

I'm one of the Fjelltopp team that is working on this feature.

We've spent sometime thinking this through and we'd like to get your opinion before we write any code. The plan is as follows:

The user can specify the foreignKey reference as a schema in one of 3 ways: URL, JSON object or the filename of a schema from a known directory. For the latter, we would introduce 'validation schema files' residing in some directory (e.g. "ckanext.validation.validation_schemas_directory"). This also allows reusing of schemas and is something we used heavily in our old fork.
To find the referenced resource we look at all resources in parent dataset for one that has the referenced schema (but only as a filename or a URL) - exactly as @ChasNelson1990 described above. Finding none or more than 1 resource would return an error.
We substitute all filename / URL references within the original schema with actual content and modify foreignKey to point to the generated name of the found resource so frictionless gets all the schema inlined. We pack everything into a virtual Package, validate it in frictionless and return the report.

We'd be really grateful to know if it makes sense for you :).

ChasNelson1990 commented 1 year ago

Hi @amercader @aivuk - have you guys seen the PR I created for this - #86?

Would be great to get some feedback and thoughts.

amercader commented 1 year ago

Sorry @ChasNelson1990 , we have very limited bandwidth right now but I'll try to look at this probably next week

ChasNelson1990 commented 1 year ago

No worries @amercader - we appreciate bandwidth issues! Next week would be great :-)

ckan / ckanext-validation

Support foreign keys property #84