ckan / ckanext-validation

CKAN extension for validating Data Packages using Table Schema.
MIT License
30 stars 33 forks source link

404 error when uploading a file that fails postgres rules #94

Open KatiRG opened 1 year ago

KatiRG commented 1 year ago

There is a bug in this extension when creating a new resource that contains an invalid structure (e.g. duplicate column names).

After clicking on "Add", the user is directed to a non-existent page (dataset/dataset-page-name/resource/new) with a 404 error:

Screenshot 2023-11-02 at 14 21 04

What should happen

After clicking "Add", if the validation status is not "success":

  1. the user should be kept on the add resource page
  2. the validation report should be shown at the top of the page, in the same way that it is displayed for updates of existing resources
  3. only after the user uploads a valid file should the application redirect to the resource page
JVickery-TBS commented 1 year ago

@KatiRG This is when you are using ckanext.validation.run_on_create_sync = True and ckanext.validation.run_on_update_sync = True?

To my understanding, I thought that when using the validation extension in sync mode, it runs the validation on resource creates and updates. And if that fails validation, then it will not create/update the Resource.

What version of CKAN are you running?

KatiRG commented 1 year ago

@JVickery-TBS We have the following settings:

ckanext.validation.run_on_create_sync = True
ckanext.validation.run_on_update_sync = True
ckanext.validation.run_on_update_async = False
ckanext.validation.formats = csv
ckanext.validation.show_badges_in_listings = True

and we are using CKAN 2.9.7. ckanext.validation.run_on_update_sync needs to be true otherwise an uploaded file will not be validated. But xloader is pushing files that don't pass validation for some reason.

KatiRG commented 1 year ago

Update: it errors out because of a call to delete the new resource here:

https://github.com/frictionlessdata/ckanext-validation/blob/master/ckanext/validation/logic.py#L679

Creating a new resource with an invalid CSV file means there is no resource in the database, therefore the call to delete it (t.get_action(u'resource_delete')) will fail.

JVickery-TBS commented 1 year ago

@KatiRG sorry for the sporadic replies! I am able to recreate this locally. So am just going to debug through it a bit to see what the issue is.

But it looks like the Resource gets created for me, but then yes, I also get hit with that 404 but the web address is still /resource/new which generally means that something is returning the Template/render method in a resource create view override instead of returning an abort or redirect.

Hopefully will have more for you soon!

JVickery-TBS commented 1 year ago

Okay I found the issue:

  1. In resource_validation_run, it calls resource_show;
  2. In resource_show action method, it loops through the package to confirm that the resource belongs to it (https://github.com/ckan/ckan/blob/2.9/ckan/logic/action/get.py#L1097), here it fails and raises a NotFound;
  3. But at this point we are in the Resource Create View (or Update View) which catch NotFound exceptions (https://github.com/ckan/ckan/blob/2.9/ckan/views/resource.py#L226), where is aborts the view. And in this case, the error message just happens to be misleading

So I am now just debugging more here, and seeing if it would be safe to catch the NotFound exception in resource_validation_run to fix this here. Im sure we can fix this some how in the resource_validation_run method

KatiRG commented 1 year ago

Thanks @JVickery-TBS !

I found that removing the delete action is sufficient to resolve this error. The check in logic.py already checks if the report if valid: https://github.com/KatiRG/ckanext-validation/blob/bug_94/ckanext/validation/logic.py#L677

If the resource is not valid (as checked by resource_validation_run), then

Please check if this solution also works for you! I can make a PR if you like.

KatiRG commented 1 year ago

Screenshot of new resource create, when an invalid file has been uploaded. The resource does not get pushed to the database:

Screenshot 2023-11-16 at 11 05 12

KatiRG commented 1 year ago

There is one issue left which is to not display the invalid resource in the dataset page: Screenshot 2023-11-16 at 11 16 22

JVickery-TBS commented 1 year ago

@KatiRG The error that you original got, and that I was able to replicate was coming from Xloader. Let me know you are running with xloader plugin installed here.

So, yes, the validation failure would delete the resource, but then Xloader uses IResourceUrlChange which is a hook that happens on before_commit. In Xloader's notify method, it does the resource_show and of course at this point the resource has been deleted by Xloader. But then the raised exception happens and then Validation extension is halted and the delete resource session commit does not happen and the Resource gets created anyways.

I personally like to use IDomainObjectModification implementation instead of IResourceUrlChange because you have more parameters in the IDomainObjectModification notify method. And, it automatically catches exceptions, logs them, and passes them because CKAN knows that extensions will override action methods and session commits.

I already have a PR on the xloader repo which would actually solve this problem: https://github.com/ckan/ckanext-xloader/pull/196/files

As for the functionality of the Validation extension and how synchronis validation works. It is to my understanding that Resource do NOT get created or updated if the validation fails (which happens on form submission or API POST/PATCH). Thus, the Resource would not show up on the Dataset page.

If you want to have the Resources get created even if they fail validation, then you would want to use async. This is my understanding of how the developers of the Validation extension wanted it to work.

That being said, there could be another option here to make Resources that do not pass validation become Draft instead of just deleting them.

JVickery-TBS commented 1 year ago

https://github.com/ckan/ckanext-xloader/pull/198 <- PR for quicker acceptance into Xloader here.

https://github.com/qld-gov-au/ckanext-xloader/pull/69 <- and the Queensland one, if you use that fork instead.

JVickery-TBS commented 1 year ago

@KatiRG ah sorry, I understand your "There is one issue left which is to not display the invalid resource in the dataset page" comment. Sorry!

Yes, that would have probably been caused from the removal of the delete. But that would be solved by keeping the delete, and fixing the issue in Xloader. (I shall go see if datapusher also has this issue)

KatiRG commented 1 year ago

@JVickery-TBS Keeping the delete doesn't resolve that last issue. With the delete in place, the user will be directed to a 404 page after adding the file: Screenshot 2023-11-16 at 11 42 05

but then they can navigate via the menu back to the dataset page, and the resource will be listed: Screenshot 2023-11-16 at 11 58 32

JVickery-TBS commented 1 year ago

@KatiRG Yeah that is correct. That is because of the order of execution between Validation and Xloader extensions. So what happens is:

  1. Submit Resource form;
  2. Gets validated, failing validation;
  3. A DELETE for the Resource is added to the session;
  4. Gets submitted to Xloader, fails because the Resource has been deleted from the current queue, throws an exception;
  5. The Resource Create View catches exception and prints 404 page, stopping the remaining code execution which includes the session commit call in the Validation extension, so the DELETE resource never happens.

The Xloader extension needs to handle the exception of the Resource not existing during the submission to Xloader. Currently, it does not and really messes things up.

https://github.com/ckan/ckan/blob/master/ckan/model/modification.py#L67 <- there is even a comment in this model in the exception handling:

# Don't reraise other exceptions since they are generally of
# secondary importance so shouldn't disrupt the commit.

So the issue, is that Xloader is disrupting the commit here.

JVickery-TBS commented 1 year ago

@KatiRG if you are using your own fork of Xloader, you can try to cherry pick from my commit PR here: https://github.com/ckan/ckanext-xloader/pull/198

Otherwise, I don't think it should take too long for them to accept this PR. I have been doing a fair amount for Xloader this past week.

KatiRG commented 1 year ago

Thanks @JVickery-TBS , I will review!

I think Ian Ward has had a similar issue and got around it by disabling xloader's normal loading (https://github.com/open-data/ckanext-xloader/blob/567584b1fbb82987da662e6fde8fa4054d204abf/ckanext/xloader/plugin.py#L110-L112) and then specifying in ckanext-validation to only call xloader if the validation is a success: https://github.com/open-data/ckanext-validation/blob/ee70c9b3a1606c26fd3289a9ec12c32cb5800de7/ckanext/validation/jobs.py#L108C38-L115

JVickery-TBS commented 1 year ago

Yeah, we are using async for Validation, so the Resources get created no matter what so we never came across this Xloader error.

And then, yeah we disabled the default xloader submission code entirely, and then we manually call that xloader submit action in the validation extension.

I have allotted some time for myself to make this into an upstream contribution on the Xloader and Datapusher extensions. Something like ckanext.xloader.require_validation = True and then check validation statuses before submitting to Xloader. Should be nice for everyone to have, and not very specific code in our two forks haha

KatiRG commented 1 year ago

Update: I have wrapped the resource_delete in a try and this seems to have taken care of the original bug reported in this issue and the extra issue of invalid, non-existent resources being listed in the dataset page.

if new_resource:
        try:
            # Delete resource
            t.get_action(u'resource_delete')(
                {u'ignore_auth': True, 'user': None},
                {u'id': resource_id}
            )
        except NoResultFound:
            raise t.ObjectNotFound(
            'This resource does not exist in the datastore')