handle broken links in pipeline #193

nutcracker22 commented 2 years ago

Description

This is to solve issue #193 by handling errors from broken links during the run of pipeline.yml. I thought that there is a shorter way of handling the error, based on @JackGilmore's comment. Please let me know what you think. Even major changes to the approach are no problem ;) So far, the main change is in processor.py, where the error handling is introduced while requesting the URL for data collection. The changes in arcgis.py, usmart.py, ckan.py and dcat.py are rather minimal: only introducing a check, whether the returned data (from processor.py) is "NULL".

Motivation and Context

See #193

With the introduced changes, the processor.py would try to execute the get_json() function (line 41), and continue running the code as it did previously, in case no error occured.

If an error occurs (link is not accessible), an error message will be printed to the terminal, and an error string will be appended to a log.json in ../opendata.scot_pipeline/. In case of such an error, the get_json() function will return "NULL", which leads to the URL being skipped in the file that called processor.py (either arcgis.py, usmart.py, ckan.py, or dcat.py).

Until the pipeline.yml is updated, the log file has to be created manually for testing, since processor.py depends on the file.

If this approach is approved, a second PR with changes to pipeline.yml has to be created, with the following functionalities added to pipeline.yml:

deleting old log.json
creating new, empty log.json before arcgis.py is called
if notifiying the Slack channel about the errors is wanted, a Slack message should be created out of log.json.

@JackGilmore: since I do not feel comfortable to mess around with the pipeline, could you add these functionalities, if this PR is fine? And if you need different content in log.json, just let me know.

fixes #193, fixes #195

How Has This Been Tested?

introduced broken links to sources.csv and ran main.sh
when running ckan.py, the URLError mentioned in #195 was caught and outputted as intended

Screenshots (if appropriate):

Types of changes

[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

[x] My code follows the code style of this project.
[ ] My change requires a change to the documentation.
[ ] I have updated the documentation accordingly.
[x] I have read the CONTRIBUTING document.
[ ] I have added tests to cover my changes.
[ ] All new and existing tests passed.

KarenJewell commented 2 years ago

The approach sounds solid to me 👍 but sadly I can only get round to testing it out tomorrow evening. Superb work @nutcracker22 and thanks for such a speedy response!

JackGilmore commented 2 years ago

Thanks @nutcracker22. Not had a chance to look at this yet but an idea had occurred to me regarding the log.json. GitHub Actions supports markdown output to generate a job summary to display once the job is finished. Source: https://github.blog/2022-05-09-supercharging-github-actions-with-job-summaries/

I wonder if it's worth implementing that as a log output? I'd had aspirations to use it as one of the sync summary outputs (e.g. x datasets updated, y added and z deleted) but it sits on my ever-growing pile of to-do's! Even if you could just dump the logs to a .md file and then I can take it from there to output in the pipeline. I'm also happy to add the functionality to the pipeline to notify on failure :)

nutcracker22 commented 2 years ago

@JackGilmore: I added the logging to a markdown file, but kept the logging to json for the time being. The heading and the top of the table in the log.md file have to be created by the pipeline. I've already updated the main.sh to do these steps, which might help you for pipeline.yml.

If you like me to change things, just let me know.

KarenJewell commented 2 years ago

@nutcracker22 this works great thanks! I've tested this against fake links for arcgis, dcat, usmart and ckan - all successful.

3 things:

Could you remove the block comments from main.sh please?
I think these logs (in main.sh) should write to local the_od_bods repo rather thanopendata.scot_pipeline repo. The logs for the pipeline should come from the github actions executed pipeline itself (prod), not local main.shexecution (dev). Happy for @JackGilmore to weigh in here.
More a consideration than a request, have we given any thought to how we might handle deadlinks in the webscraper scripts - realise this is more a pipeline issue than the_od_bods but just flagging it.

nutcracker22 commented 2 years ago

@KarenJewell: re 1) I had already removed the block comment and committed, but not pushed ;) re 2) I just changed it, but then realised, that processor.py will append logs to the pipeline log file. Right now I don't know how to make processor.py aware whether it's run by main.sh or by pipeline.yml. A way to solve this might be to do the handling by processor.py and main.sh of log.json in the_od_bods repo/folder. The log.md will only be deleted by main.sh, but no new .md file created. The processor.py would still create this file in ../opendata.scot_pipeline/, but since the first lines are missing, it will be clear that it was created during the run of main.sh. If you have another way, just let me know :) re 3) I also recognised while coding this solution, that the same problem exists for the scrapers. My idea was to implement the same code in all scrapers, meaning they would write to the same log files.

KarenJewell commented 2 years ago

yeah totally get what you're saying there @nutcracker22 but I'm happy with the latest set of changes you've made to main.sh now it doesn't write to specified pipeline repo.

I've tested again and happy to merge changes. There are outstanding changes needed for the pipeline but we'll catch these back in the original issue #193 . This PR is not a breaking change regardless of pipeline change.

OpenDataScotland / the_od_bods