ckan / ckanext-validation

CKAN extension for validating Data Packages using Table Schema.
MIT License
28 stars 32 forks source link

Allow upload of invalid resource data #85

Open fulior opened 1 year ago

fulior commented 1 year ago

Currently ckanext-validation doesn't allow uploading resources which don’t pass validation. If the validation fails the extension tries to remove a failed resource file. But there are cases where CKAN should be allowed to store invalid data. In cases like asynchronous bulk upload or large tabular data compiled by multiple parties the validation report is important for data curators to know what’s going on in the system.

This could be solved by adding extra configuration option:

ckanext.validation.allow_invalid_data = False

This would let us to modify logic in ckanext/validation/logic.py (https://github.com/frictionlessdata/ckanext-validation/blob/1073c80dace453a404df3d5f1ac1ae86a88b5029/ckanext/validation/logic.py#L665) to preserve data file in case of validation error e.g:

allow_invalid_data = bool(t.config.get(
'ckanext.validation.allow_invalid_data'
))

if not report['valid'] and not allow_invalid_data:
...

I’ll gladly provide a PR for this if you think this is something useful. It’s a requirement of our users, so we’re currently using it in a forked version of ckanext-validation

ThrawnCA commented 1 year ago

I was under the impression that if only asynchronous validation is used, then the desired behaviour is already the case; you can upload an invalid resource and just get a report on the failures.

It's when you enable synchronous validation that uploads will be rejected.

fulior commented 1 year ago

You're right, I've missed this, I was testing v2.0.0 in synchronous mode. Wouldn't it make more sense to have the same behavior for both async and sync mode?

  1. Either a configurable option like ckanext.validation.allow_invalid_data_in_sync_mode.
  2. Or just enable invalid data upload in sync mode permanently, so it's consistent with the asynchronous mode?
ThrawnCA commented 1 year ago

Wouldn't it make more sense to have the same behavior for both async and sync mode?

I don't think so. They're not just performance differences, they actually give you different feedback. Synchronous validation can actually give you an error, whereas async just gives you a report.

If you have chosen not to validate resources until after upload is complete, then clearly you are okay with potentially invalid resources existing in your system.

If you have chosen to take the performance hit to validate potentially large resources on the spot, then clearly you care a lot about ensuring that they're valid, and it makes sense to deny invalid uploads.

It's like the difference between having metal detectors and X-raying luggage, vs just having security cameras and doing background checks. One has the capability and expectation to actually deny entry to those who fail, the other doesn't.

fulior commented 1 year ago

In our case we have users that are collaborating on the uploaded data. Some files can be incomplete (e.g. incomplete patient surveys etc.) and the gaps will be filled by other users. I understand this can be achieved with asynchronous mode, but I think this behavior should be available in both modes.

tomeksabala commented 1 year ago

Hello @ThrawnCA . I'm working together with @fulior on the issue. Thanks a lot for your valuable feedback and pointing out the differences between sync and async validation regarding data file deletion. This resolves one of our use cases. However, I still find the option to keep invalid data uploaded to the system with sync validation sensible. Two big reasons behind it (I wonder what you think?):

  1. synchronous validation is easier to work with on local dev env, and it would be nice to reproduce the production behaviour of keeping invalid data in the system with the validation report
  2. a use case where data is curated by multiple users and someone would like to upload a partially valid file to CKAN (with synchronous validation) so other parties can update the file further (e.g. adding missing column values).

ckanext.validation.allow_invalid_data could be False by default and hence preserve old behaviour. It is not a big logic change and would make our configuration and workflow much more consistent.

ThrawnCA commented 1 year ago

synchronous validation is easier to work with on local dev env

How so? Wouldn't it be more important to replicate production behaviour? With a local database, I would expect that validation would complete in seconds at most, so there shouldn't be any particular difficulties in testing it.

a use case where data is curated by multiple users and someone would like to upload a partially valid file to CKAN (with synchronous validation) so other parties can update the file further (e.g. adding missing column values).

Why couldn't this use asynchronous validation? All you need is the report to tell you when you've successfully fixed it, right? So asynchronous validation works fine.