Closed adinuca closed 6 years ago
Hi @EricSoroos, could you please take a look at this issue?
Logs can be found here.
Looking into this a bit -- the error is in that ckanext-dcat (and the dependency rdflib) strictly expects that URL be a valid URL and doesn't trap the error or skip when it's invalid. We definitely have metadata that's not a valid url, so the combination of that metadata not conforming to the field definition and the strict definition of the formats causes the error.
3 options to fix:
Skip the links if url isn't valid. -- The links are done in three places in the templates, we have access to the url, but the helper is_url is ... not helpful, as it doesn't check for invalid urls. We could add a helper to dcat, or just check for a couple of the likely invalid characters that we hit, like space. Easy to do as a hack, a little more involved to do it cleanly.
Trigger a different, verbose error on the actual link. This is pretty easy, and will help get those urls out of the search engines. We should probably trigger a 4xx series error, but I don't see a good one off hand.
Patch rdflib to skip the url if it's invalid. This is a bigger job, especially to do it in a way that makes these still be valid n3/turtle files.
sample possible error response:
I think the response can be shorter. Something like : "Format not supported due to invalid URL".
It would be good if you could also catch the exception and log a message that tells you exactly what the issue is, instead of the long stack-trace.
That error message is 90% url. That response is essentially catching the exception and returning something useful to the browser that will explain the situation and prevent a crawler from retaining it.
I don't think we need to log it, since we know exactly what's causing it and can find all cases of this with a sql query.
Ok @EricSoroos, my main reason for the above message was to not have so many error logs that don't help. I agree we can just ignore the exception and return a proper response to the user.
Hi @EricSoroos , do you have any update on this? There have been a few emails regarding errors generated by these URLs
Hi @EricSoroos, any update on this?
Hi @deirdrelee, do you know when this will get done? As with #249, it is hard to spot real problems in the logs because of these error logs generated by this issue.
I've pushed a fix for this to staging
Thank you!
This has been fixed and deployed to production by @EricSoroos . Thank you!
Why
There are lots of errors reported because the
turtle
format and thenotation3
format cannot be generated for datasets, when the URL is not valid.What
Notes
The URLs for the 2 formats is available in the source code of the dataset page(Eg: https://www.resourcedata.org/dataset/33a07bf8-35f4-45be-a951-b61aed8287ac)
Examples: https://www.resourcedata.org/dataset/33a07bf8-35f4-45be-a951-b61aed8287ac https://www.resourcedata.org/dataset/33a07bf8-35f4-45be-a951-b61aed8287ac.ttl https://www.resourcedata.org/dataset/33a07bf8-35f4-45be-a951-b61aed8287ac.n3
https://www.resourcedata.org/dataset/f4d3130b-4557-47fb-b609-6b0080b05025 https://www.resourcedata.org/dataset/f4d3130b-4557-47fb-b609-6b0080b05025.ttl https://www.resourcedata.org/dataset/f4d3130b-4557-47fb-b609-6b0080b05025.n3
https://www.resourcedata.org/dataset/7bbcb65a-653c-42ea-acb0-45943630bbef https://www.resourcedata.org/dataset/7bbcb65a-653c-42ea-acb0-45943630bbef.ttl https://www.resourcedata.org/dataset/7bbcb65a-653c-42ea-acb0-45943630bbef.n3
https://www.resourcedata.org/dataset/28350801-8f55-4155-81ca-874b94b0809d https://www.resourcedata.org/dataset/28350801-8f55-4155-81ca-874b94b0809d.ttl https://www.resourcedata.org/dataset/28350801-8f55-4155-81ca-874b94b0809d.n3