ThreeSixtyGiving / grantnav

This is a web based search tool for data in the 360 giving data format.
http://grantnav.threesixtygiving.org/
Other
9 stars 5 forks source link

Prepare a helper script for GN data loading #421

Closed robredpath closed 5 years ago

robredpath commented 6 years ago

See internal issue https://opendataservices.plan.io/issues/12129

Bjwebb commented 6 years ago

List of things that are currently checked manually that we would like to automate:

@KDuerden anything to add to this list?

KDuerden commented 6 years ago

Multiple funder IDs (as mostly done in error)

Broken links and Validity are only ones that tend to change without warning and would be the priority for an automated status/feedback. In most cases the rest would normally be apparent at point of adding or updating data on the Registry. So while it would be useful to have these checks automated, it is the timing with relation to the GN load that is really the issue. If I miss them at point of adding the data to Registry they only surface once the load has happened, and it is too late to get it fixed without having to re-run the load. Essentially GN load process is the current automated check for all the above.

Data protection would be difficult to automate (beyond what CoVE does with email addresses) as it involves me actually doing spot checks of the data.

Bjwebb commented 6 years ago

Thanks!

I'd previously set up some automated testing in the test_registry repo: https://github.com/ThreeSixtyGiving/test_registry I've added broken link checking to that. The test runs on Travis, here's what the output looks like https://travis-ci.org/ThreeSixtyGiving/test_registry/jobs/329149072s It's not very easy to read, so I'll set up a summary spreadsheet.

Bjwebb commented 6 years ago

Here's a summary spreadsheet: https://docs.google.com/spreadsheets/d/1iRH0N07Fi-XM6HcZLSR688EiPA4EQGc5wSP1hJIx3L4/edit#gid=0 This should refresh each day. This currently covers whether downloads work (ie. broken links), and whether the license is correct. Hopefully that's already useful at identifying some problems.

There's columns for whether data converts, or is valid, but these are empty atm because some files are too big for Travis to deal with. I think a sensible first approach here could be to exclude the large files, and to check the rest.

These tests also rely on converting the data:

This leaves "Appropriate hosting pages" -> we could check that these don't 404.

KDuerden commented 6 years ago

Excluding very large files makes sense as compromise. This is already useful - can see one file has unexpected broken link!

Bjwebb commented 6 years ago

Conversion and validation tests added to the table https://docs.google.com/spreadsheets/d/1iRH0N07Fi-XM6HcZLSR688EiPA4EQGc5wSP1hJIx3L4/edit#gid=0

Bjwebb commented 6 years ago

I needed an automated GitHub user to upload to GitHub gist, so I made an extra account - https://github.com/360bot

KDuerden commented 6 years ago

@Bjwebb I've been checking this daily & it has already proved useful for flagging an issue with a wandering file. Thank you!

KDuerden commented 6 years ago

@Bjwebb only seeing 20 datasets at the moment. Is this an easy fix?

BobHarper1 commented 6 years ago

@KDuerden @Bjwebb Esmee seems to be off https://esmeefairbairn.org.uk/userfiles/Documents/JSON%20grants%20list/error.aspx causing the error?

KDuerden commented 6 years ago

File is moved, so I've updated - thanks @BobHarper1. Just simple broken links shouldn't break it though?

Bjwebb commented 6 years ago

Hmm, not quite sure what's going on here. Will take a look.

Bjwebb commented 6 years ago

Still not sure what broke this, but running it again seems to have fixed it.

BobHarper1 commented 6 years ago

I think @KDuerden updating the link on the registry probably fixed it, maybe.

KDuerden commented 6 years ago

This continues to be a massive help! Laziness on my part, but could the access url be pulled through into the report too? It would make it quicker to check what is going on with broken links. No worries if not.

KDuerden commented 6 years ago

Woodward and Zing aren't being picked up in this list - they are the last two when sorted alphabetically.

Bjwebb commented 6 years ago

I think maybe useful to have separate issues in test_registry for these https://github.com/ThreeSixtyGiving/test_registry/issues/2 https://github.com/ThreeSixtyGiving/test_registry/issues/3

Bjwebb commented 6 years ago

I've fixed those two issues, see https://docs.google.com/spreadsheets/d/1iRH0N07Fi-XM6HcZLSR688EiPA4EQGc5wSP1hJIx3L4/edit#gid=0

KDuerden commented 6 years ago

I moved Northern Rock file to Google docs on Friday. It passes CoVE but it is not passing Valid on the list. Is this something in the file or does the test need updating to handle google docs?

BobHarper1 commented 6 years ago

I ran this locally and can see that the converted json has in it:

        {
            "#": "About this sheet",
            "hashComments": "This sheet provides \"metadata\" about this dataset - useful information for users of this data. None of the data in this sheet is part of the 360Giving Standard and if necessary it can be removed prior to use."
        },
        {
            "#": "Publisher:",
            "hashComments": "Northern Rock Foundation"
        },
        {
            "#": "Date published:",
            "hashComments": "2016-05-26T00:00:00+00:00"
        },
        {
            "#": "Licence:",
            "hashComments": "Open Data Commons Public Domain Dedication and Licence 1.0"
        },
        {
            "#": "Terms of Use:",
            "hashComments": "This work is licensed under the Open Data Commons Public Domain Dedication and Licence 1.0. To view a copy of this license, visit http://www.opendefinition.org/licenses/odc-pddl. This means the data is freely accessible to anyone to be used and shared as they wish with no restrictions."
        },
        {
            "#": "Title:",
            "hashComments": "Northern Rock Foundation 360Giving data"
        },
        {
            "#": "Standard:",
            "hashComments": "360Giving Standard"
        },
        {
            "#": "Schema:",
            "hashComments": "http://standard.threesixtygiving.org/en/latest/_static/360-giving-schema.json"
        },
        {
            "#": "Contact:",
            "hashComments": "For queries about this data contact support@threesixtygiving.org"
        },
        {
            "#": "Period:",
            "hashComments": "10/03/1998 to 30/06/2014"
        },
        {
            "#": "Description:",
            "hashComments": "Grants awarded between 1998 and June 2016. Northern Rock Foundation grants data is hosted by 360Giving on behalf of the foundation. Northern Rock Foundation formally closed on 25 April 2016."
        }
BobHarper1 commented 6 years ago

Sorry, closed by mistake! I mean to continue typing.... so the hashComments aren't being ignored, I'll have a look at how that can be fixed.

Bjwebb commented 6 years ago

Ah, this will be due to datagetter flatten-tool requirement being out of date: requirements.in requirements.txt

BobHarper1 commented 6 years ago

@Bjwebb Ah! Ok, saves me fiddling around with file. Thanks!

robredpath commented 6 years ago

@Bjwebb is the solution here to just bring datagetter forward to the latest version of flatten-tool? I'm happy to do that and test.

Bjwebb commented 6 years ago

@robredpath That's right, thanks.

robredpath commented 6 years ago

I've opened a PR for that. Although, from conversation with @Bjwebb last week, I don't think that's everything needed to be able to work with Google Sheets.

Bjwebb commented 6 years ago

AFAIK, google sheets should work. My impression of the problem above is it's due to hashComments, not the fact that it's on google sheets.

KDuerden commented 6 years ago

The run last night has resulted in 28 validation fails. Spot checks show they aren't failing CoVE. As it is GN load tomorrow, can this test be re-run today, so I can see the real situation? Thank you!

robredpath commented 6 years ago

Hi @KDuerden - sorry about that, we introduced a bug with a recent update. I've backed out the changes and set the tests running again - I'll let you know when it's updated.

robredpath commented 6 years ago

@KDuerden we've re-run the report and the latest one should be accurate!

robredpath commented 5 years ago

This is, I believe, now fixed.