ckan / ckanext-validation

CKAN extension for validating Data Packages using Table Schema.
MIT License
28 stars 33 forks source link

Use frictionless framework v5 #69

Closed aivuk closed 1 year ago

aivuk commented 1 year ago

Update the extension to use Frictionless Framework version 5. Change the validate function from Goodtables to the Frictionless equivalent.

TODO:

amercader commented 1 year ago

@aivuk this PR should be against dev-v2 right?

aivuk commented 1 year ago

@aivuk this PR should be against dev-v2 right?

yes, sorry about that. I already changed it to dev-v2 now

codecov[bot] commented 1 year ago

Codecov Report

Merging #69 (fff8dc5) into dev-v2 (455090b) will decrease coverage by 0.16%. The diff coverage is 88.23%.

@@            Coverage Diff             @@
##           dev-v2      #69      +/-   ##
==========================================
- Coverage   84.48%   84.31%   -0.17%     
==========================================
  Files          24       24              
  Lines        2062     2078      +16     
==========================================
+ Hits         1742     1752      +10     
- Misses        320      326       +6     
Impacted Files Coverage Δ
ckanext/validation/plugin/__init__.py 96.24% <ø> (ø)
ckanext/validation/tests/helpers.py 100.00% <ø> (ø)
ckanext/validation/jobs.py 82.85% <81.25%> (-0.87%) :arrow_down:
ckanext/validation/helpers.py 88.88% <87.50%> (-0.86%) :arrow_down:
ckanext/validation/logic.py 68.27% <100.00%> (-0.13%) :arrow_down:
ckanext/validation/tests/test_form.py 100.00% <100.00%> (ø)
ckanext/validation/tests/test_jobs.py 100.00% <100.00%> (ø)
ckanext/validation/tests/test_logic.py 100.00% <100.00%> (ø)

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

amercader commented 1 year ago

@aivuk The Aborted (core dumped) errors were not a Docker issue, but a max recursion error introduced by a change in CKAN core, tests are good now!

aivuk commented 1 year ago

@roll In the past we could pass to to the goodtables validation function an options json object where it was possible to define which tests to skip and pass parameters about the layout of the data, as number of rows, headers, etc https://github.com/frictionlessdata/ckanext-validation/#formats-to-validate In this PR I'm using directly the Frictionless validate function at https://github.com/frictionlessdata/ckanext-validation/blob/ae2087656712b3198fca66c45e2f45fc2a3c1cac/ckanext/validation/jobs.py#L154 and passing a "options" json object stored in the database by the user, but I couldn't find a way to pass the parameters to customize the validations checks as https://framework.frictionlessdata.io/docs/checks/table.html. What would you suggest?

@amercader

roll commented 1 year ago

Hi @aivuk,

In v5 the ad-hoc validation function got standardized:

dialect = Dialect() # according to Tabular Dialect
schema = Schema() # create schema according to Table Schmea spec
checklist = Checklist() # create checklist according to https://framework.frictionlessdata.io/docs/framework/checklist.html (see API reference)
resource = Resource(..., dialect=dialect, schema=schema, checklist=checklist) # create resource according to Data Resource spec
report = resource.validate()

You can consider storing the whole validation information as a resource descriptor in ckan database:

resource
  path
  scheme
  format
  ...
  dialect
  schema
  checklist

and then just use Resource(descritor).validate()

roll commented 1 year ago

So in the Framework now the whole information of data + schema + validation rules might be stored/shared as a resource descriptor. As one chunk of metadata. I think it's quite similar to greatexpections approach =)

roll commented 1 year ago

UPD. I missed Tabular Dialect in the first place. Now it's added as it's needed to manage header rows etc

roll commented 1 year ago

Also, as a part of Frictionless Application (components@2) I'm finishing visual components to create customizable editors for this metadata as:

So these metadata boundaries now is reflected on UI level too

roll commented 1 year ago

So when you need the UI part you just need to create a bootstrap modal with:

components2

which will edit the resource descriptor (including dialect/schema/etc).

aivuk commented 1 year ago

Also, as a part of Frictionless Application (components@2) I'm finishing visual components to create customizable editors for this metadata as:

* Resource Editor

* Dialect Editor

* Schema Editor

* Checklist Editor

So these metadata boundaries now is reflected on UI level too

That will be great because the next steps will be to improve the UI to customize the validation steps in CKAN Resource page. I asked myself with it would be possible to use just some of the components from the Frictionless Application, like we are doing with the Report component. Thanks!

roll commented 1 year ago

Yea that's the plan -- fully-reusable component for all metadata Frictionless has a package/resource/schema/report/etc that can be mixed in different configurations. We will maintain them as a part of the application codebase but release them as a separate/lightweight NPM package.

aivuk commented 1 year ago

@roll Is there an example of a Resource descriptor that includes the schema, checklist and dialect? Currently in ckanext-validation we have two fields saved in the database, the schema and the options, that I think that one it will not be necessary anymore. What you suggest is to just have a field where the user store all the resource descriptor information? What is the best way to create a descriptor, create the checklist separately, the layout, then the resource and save its descriptor? I want to create some descriptors with different parameters for the tests

roll commented 1 year ago

@aivuk Yes, I would switch to one resource descriptor to store in the db. Regarding editing it - there is a Python API for it so you basically can use any method you'd like. You can create schmea/etc first, you can add them on resource object, you can use resource.to_copy() etc

roll commented 1 year ago

Note that v5 has a clear boundary between descriptors and object model so the most explicit way of working with metadata will be:

resource = Resource.from_descriptor()
...
descriptor = resource.to_descriptor()

Also, take into account that it's now impossible to import a resource from an invalid descriptor as well as export an invalid object to a descriptor (if somehow you manage to make it invalid). So these calls are good to be wrapped in try/except

Or you can validate it first:

report = Resource.validate_descriptor(descriptor)
amercader commented 1 year ago

@roll @aivuk Note that the current version of ckanext-validation stores two separate fields:

I think that in terms of interoperability at the CKAN level it is important to keep the schema (which is based on a standard) separate from the validation options (which are specific to frictionless framework). The idea is that schema can be used by other integrations like the DataStore etc. so I'm keen on keeping it separate. We can definitely store the whole validation descriptor (schema + validation rules) in the validation_options and keep in sync the schema subset, or we can keep them separate and combine them when we need to call the validation, which might be probably easier in terms of keeping the current fields, validators etc (for the CKAN metadata, not for the data)

Another question, is there a way of translating goodtables validation optins to framework validation descriptors? I'm thinking in terms of migrating existing installations of ckanext-validation that are using the old options (I'm assuming there are no changes at the schema level as it's based on Table Schema)

aivuk commented 1 year ago

We can definitely store the whole validation descriptor (schema + validation rules) in the validation_options and keep in sync the schema subset, or we can keep them separate and combine them when we need to call the validation, which might be probably easier in terms of keeping the current fields, validators etc (for the CKAN metadata, not for the data)

I was thinking a bit about it and arrived at the same conclusion, I think the schema must be separate from the validation options, but I still haven't explored too much in how this split can be done and merged easily before running the validation. I will try generate some different descriptors using the framework and do some tests in how this can be done, but I think it will be straightforward.

Another question, is there a way of translating goodtables validation optins to framework validation descriptors? I'm thinking in terms of migrating existing installations of ckanext-validation that are using the old options (I'm assuming there are no changes at the schema level as it's based on Table Schema)

I think we would need to write new code to do that. The problem is that I couldn't find a good documentation in the goodtables format, only these pages:

roll commented 1 year ago

@amercader What I suggest is to store a resource desciptor according to Data Resource Standard. It has standard properties including resource.dialect (CSV/Table Dialect standard) and resource.schema (Table Schema standard). The only framework specific (custom) property will be resource.checklist (aka validation options)

Of course, if you think it's better to separate schema it will also work as it's also an interchangeable standard but you also need to store other resource metadata somewhere (dialect/etc)

roll commented 1 year ago

@aivuk https://github.com/frictionlessdata/framework/tree/v2

amercader commented 1 year ago

Thanks both for the input. I still think keeping just the Table Schema separate makes more sense to keep the whole implementation as is. So @aivuk let's keep the separate schema and validation_options fields so everything keeps working as now and piece together the resource descriptor needed by framework just before validation. IIUC the new validation options format can be something like {"dialect": {}, "checklist": {}"}, which will be combined with the schema to create the resource descriptor needed

roll commented 1 year ago

I agree it's of course the quickest way. In the end, it doesn't matter how an exact implementation stores Frictionless concepts internally.

In the future, I think it still will be good to rebase on more standards / framework@5 compatible terminology (e.g. dialect is a resource/file's property, not a validation option, and also you might need to store encoding/format/compression/etc making validation_options basically a resource descriptor de-facto).

@amercader I have a more high-level question. Have you ever thought about adopting (Tabular) Data Resource Standard to be an internal part of CKAN for resource/file metadata description? We started a similar discussion with Dan Feder from DCAN and, currently, they are working on supporting something like this. In Frictionless Application, any resource is also described by Data Resource descriptor.

amercader commented 1 year ago

@roll see https://github.com/frictionlessdata/ckanext-validation/discussions/72

aivuk commented 1 year ago

@amercader @roll I did update in my last commit the plugin to parse the validation options as:

{"dialect": { ...}, "checks": {...}}

I used "checks" instead of "checkslist" to follow the frictionless framework nomenclature. Options as "skip_errors" or "pick_errors" can also be passed. @roll do we have a documentation for how the Dialect descriptor must be written? I'm generating it using the framework as:

dialect = Dialect(header=True, header_rows=[1], comment_char='#')
dialect.add_control(formats.CsvControl(delimiter=';'))
dialect.to_descriptor()

Do we have the documentation for how to pass the checks parameters descriptors? I could find only an example here https://framework.frictionlessdata.io/docs/checks/table.html#using-declaratively. I think it would be a good idea to create a list in the framework documentation with all the possible checks and their parameters, and this can be generated automatically.

Currently in the CKAN Resource Validation Options form we don't give much information to the user in how to write the options: Screenshot from 2022-10-24 23-40-11 We just point to the github repository of ckanext-validation. I think it would be better to add a short text, maybe a button that would open a small dialog window with some help text and point the user to a better page, probably a link to the Frictionless framework documentation. Same thing for the Data Schema field.

roll commented 1 year ago

Hi @aivuk!

Can you please elaborate on the first question (how to pass)?

Regarding the checks docs it's for example here - https://framework.frictionlessdata.io/docs/checks/table.html#table-dimensions - and others in the menu under "Validation Checks". All of them have a "API Reference" section. Also, @shashigharti is currently working on adding more docs to checks/steps instead of "NOTE: add docs"

roll commented 1 year ago

TBH I don't fully understand a new approach. If there is still a field for CKAN end-users called "validation_options" - how they will be filling it if framework@5 doesn't even have this concept of validation options? I thought it was for internal representation, not user-facing.

aivuk commented 1 year ago

TBH I don't fully understand a new approach. If there is still a field for CKAN end-users called "validation_options" - how they will be filling it if framework@5 doesn't even have this concept of validation options? I thought it was for internal representation, not user-facing.

The ckanext-validation is running the function validate from frictionless framework. If an user wants to pass parameters to the validate as skip_errors or pick_errors, or customize the dialect, or pass parameters to the validation checks, s/he is going to do it passing a json in the "validation options" field.

Can you please elaborate on the first question (how to pass)?

I mean a simple list with all the checks options names as 'table-dimensions' and their parameters. The only list that I could find was for the baseline checks but there is no documentation for what each one does https://framework.frictionlessdata.io/docs/checks/baseline.html

aivuk commented 1 year ago

@roll Just an example, currently the user can put in the validation options field the json:

{
  "checks: [
    {"type": "table-dimensions", "minRows": 3},
    {"type": 'ascii-value'}
  ],
  "dialect": {
     "header": true,
     "headerRows": [1],
     "commentChar": "#",
     "csv": {
       "delimiter": ";"
     }
  }
}
amercader commented 1 year ago

@roll

TBH I don't fully understand a new approach. If there is still a field for CKAN end-users called "validation_options" - how they will be filling it if framework@5 doesn't even have this concept of validation options? I thought it was for internal representation, not user-facing.

We just need a way for users for inputing parameters for the validate() call. Right now we have this existing field which accepts a JSON object so it's easy to dump everything there rather than create separate checks, dialect, skip_errors etc fields in the form. The end goal should be to have a proper UI for tweaking all these settings, which I assume would be another frictionless component. When that's ready we'll refactor the form to use this and internally we'll settle on a way to pass the parameters to frictionless' validate() function

roll commented 1 year ago

@amercader That totally sounds good as an intermediate step. Just wanted to ensure that we're on the same page regarding the final vision (a UI component field instead of the validation_options field)

amercader commented 1 year ago

Great work @aivuk ! :rocket: