OpenDataScotland / jkan

A lightweight, backend-free open data portal, powered by Jekyll, based on the JKAN project
https://opendata.scot
MIT License
6 stars 13 forks source link

script a check to check platform health e.g. org mismatches or datasets that are 404ing #32

Open KarenJewell opened 2 years ago

JackGilmore commented 2 years ago

@KarenJewell Do you think we still need this? We run the alive.py script daily to make sure we don't have any disappearing open data portals, and if individual datasets are removed from their portals then they are removed from opendata.scot on the Friday sync.

Only other work I can see us needing to do here would be the org mismatches but I'm not sure what that problem entails as I can't remember encountering it?

If we close this then I'd also consider closing OpenDataScotland/jkan#29

We do now have https://opendata.scot/analytics/platform-health/ for organisations at least and I can expand this report to also report on population of each of the fields in a dataset. If we're to do that though I would propose opening another issue.

KarenJewell commented 2 years ago

Related conversation copied from Slack (17 Jun 2022) https://opendatascotland.slack.com/archives/C02HEHDL8AY/p1655492751339199:

[Karen] Also, open to thoughts on this: https://github.com/OpenDataScotland/jkan/issues/32 Jack Gilmore's raised a good Q in it about redundant links. Now technically... if the whole publisher's portal is down or the dataset is removed, we don't list it anymore, so opportunities for deadlinks are few. BUT it is possible that a listing is just neglected, or migrated and the old link isn't removed, but the underlying file location is changed, so we end up with some deadlinks. We have no way of telling unless we call each and every file, but it takes a long time to do that (upward of 20 mins on my last check) so, not worth it? We live with the deadlinks? Or is there a better way than my wee looping requests function?

[Jack] Just for clarity, what are you looking to check for deadlinks? The original dataset link? Or the actual individual resources under the dataset? Or both?GitHub Actions could help to solve some of the time/resource-based issues here I think. You can run jobs in parallel so we could even try something super simple like splitting the link list in half and giving half to each job. I'm also assuming Python supports some sort of concurrent functionality for sending HTTP requests?

[Andrew] Could you maybe have a facility for someone to report a link as dead and then add that link to a list of known dead links that can be checked to see if just down or gone after a certain number of checks? That would save having to check every link.

[Karen] It’s the resources links (asset link)