Closed robredpath closed 5 years ago
List of things that are currently checked manually that we would like to automate:
@KDuerden anything to add to this list?
Multiple funder IDs (as mostly done in error)
Broken links and Validity are only ones that tend to change without warning and would be the priority for an automated status/feedback. In most cases the rest would normally be apparent at point of adding or updating data on the Registry. So while it would be useful to have these checks automated, it is the timing with relation to the GN load that is really the issue. If I miss them at point of adding the data to Registry they only surface once the load has happened, and it is too late to get it fixed without having to re-run the load. Essentially GN load process is the current automated check for all the above.
Data protection would be difficult to automate (beyond what CoVE does with email addresses) as it involves me actually doing spot checks of the data.
Thanks!
I'd previously set up some automated testing in the test_registry repo: https://github.com/ThreeSixtyGiving/test_registry I've added broken link checking to that. The test runs on Travis, here's what the output looks like https://travis-ci.org/ThreeSixtyGiving/test_registry/jobs/329149072s It's not very easy to read, so I'll set up a summary spreadsheet.
Here's a summary spreadsheet: https://docs.google.com/spreadsheets/d/1iRH0N07Fi-XM6HcZLSR688EiPA4EQGc5wSP1hJIx3L4/edit#gid=0 This should refresh each day. This currently covers whether downloads work (ie. broken links), and whether the license is correct. Hopefully that's already useful at identifying some problems.
There's columns for whether data converts, or is valid, but these are empty atm because some files are too big for Travis to deal with. I think a sensible first approach here could be to exclude the large files, and to check the rest.
These tests also rely on converting the data:
This leaves "Appropriate hosting pages" -> we could check that these don't 404.
Excluding very large files makes sense as compromise. This is already useful - can see one file has unexpected broken link!
Conversion and validation tests added to the table https://docs.google.com/spreadsheets/d/1iRH0N07Fi-XM6HcZLSR688EiPA4EQGc5wSP1hJIx3L4/edit#gid=0
I needed an automated GitHub user to upload to GitHub gist, so I made an extra account - https://github.com/360bot
@Bjwebb I've been checking this daily & it has already proved useful for flagging an issue with a wandering file. Thank you!
@Bjwebb only seeing 20 datasets at the moment. Is this an easy fix?
@KDuerden @Bjwebb Esmee seems to be off https://esmeefairbairn.org.uk/userfiles/Documents/JSON%20grants%20list/error.aspx causing the error?
File is moved, so I've updated - thanks @BobHarper1. Just simple broken links shouldn't break it though?
Hmm, not quite sure what's going on here. Will take a look.
Still not sure what broke this, but running it again seems to have fixed it.
I think @KDuerden updating the link on the registry probably fixed it, maybe.
This continues to be a massive help! Laziness on my part, but could the access url be pulled through into the report too? It would make it quicker to check what is going on with broken links. No worries if not.
Woodward and Zing aren't being picked up in this list - they are the last two when sorted alphabetically.
I think maybe useful to have separate issues in test_registry for these https://github.com/ThreeSixtyGiving/test_registry/issues/2 https://github.com/ThreeSixtyGiving/test_registry/issues/3
I've fixed those two issues, see https://docs.google.com/spreadsheets/d/1iRH0N07Fi-XM6HcZLSR688EiPA4EQGc5wSP1hJIx3L4/edit#gid=0
I moved Northern Rock file to Google docs on Friday. It passes CoVE but it is not passing Valid on the list. Is this something in the file or does the test need updating to handle google docs?
I ran this locally and can see that the converted json has in it:
{
"#": "About this sheet",
"hashComments": "This sheet provides \"metadata\" about this dataset - useful information for users of this data. None of the data in this sheet is part of the 360Giving Standard and if necessary it can be removed prior to use."
},
{
"#": "Publisher:",
"hashComments": "Northern Rock Foundation"
},
{
"#": "Date published:",
"hashComments": "2016-05-26T00:00:00+00:00"
},
{
"#": "Licence:",
"hashComments": "Open Data Commons Public Domain Dedication and Licence 1.0"
},
{
"#": "Terms of Use:",
"hashComments": "This work is licensed under the Open Data Commons Public Domain Dedication and Licence 1.0. To view a copy of this license, visit http://www.opendefinition.org/licenses/odc-pddl. This means the data is freely accessible to anyone to be used and shared as they wish with no restrictions."
},
{
"#": "Title:",
"hashComments": "Northern Rock Foundation 360Giving data"
},
{
"#": "Standard:",
"hashComments": "360Giving Standard"
},
{
"#": "Schema:",
"hashComments": "http://standard.threesixtygiving.org/en/latest/_static/360-giving-schema.json"
},
{
"#": "Contact:",
"hashComments": "For queries about this data contact support@threesixtygiving.org"
},
{
"#": "Period:",
"hashComments": "10/03/1998 to 30/06/2014"
},
{
"#": "Description:",
"hashComments": "Grants awarded between 1998 and June 2016. Northern Rock Foundation grants data is hosted by 360Giving on behalf of the foundation. Northern Rock Foundation formally closed on 25 April 2016."
}
Sorry, closed by mistake! I mean to continue typing.... so the hashComments aren't being ignored, I'll have a look at how that can be fixed.
Ah, this will be due to datagetter flatten-tool requirement being out of date: requirements.in requirements.txt
@Bjwebb Ah! Ok, saves me fiddling around with file. Thanks!
@Bjwebb is the solution here to just bring datagetter forward to the latest version of flatten-tool? I'm happy to do that and test.
@robredpath That's right, thanks.
I've opened a PR for that. Although, from conversation with @Bjwebb last week, I don't think that's everything needed to be able to work with Google Sheets.
AFAIK, google sheets should work. My impression of the problem above is it's due to hashComments, not the fact that it's on google sheets.
The run last night has resulted in 28 validation fails. Spot checks show they aren't failing CoVE. As it is GN load tomorrow, can this test be re-run today, so I can see the real situation? Thank you!
Hi @KDuerden - sorry about that, we introduced a bug with a recent update. I've backed out the changes and set the tests running again - I'll let you know when it's updated.
@KDuerden we've re-run the report and the latest one should be accurate!
This is, I believe, now fixed.
See internal issue https://opendataservices.plan.io/issues/12129