Closed rossjones closed 7 years ago
The central government spend data guidance was produced by HM Treasury in 2010: https://www.gov.uk/government/publications/guidance-for-publishing-spend-over-25000
HMT describe the schema pretty accurately in prose and with a spreadsheet example. There is a JSON schema here: https://github.com/datagovuk/schemas/blob/master/spend-hmt/spend-25k.json
I've also put together a quick guide for publishers here: http://guidance.data.gov.uk/25k-spend-data.html although it's not been sent out methodically to publishers. But perhaps you would like to do a PR with a bit about how they should check their CSV complies with a schema in goodtables/csvlint.io? And clearly there has long been an opportunity to tie these into data.gov.uk and/or gov.uk.
Thanks for the pointers, I'll send the suggested PR. Disappointing to see that HMRC aren't following their own guidance, specifically for the Amount column over the last three months at https://data.gov.uk/dataset/financial-transactions-data-hmrc - perhaps it is a bigger problem than I thought.
I don't know where the HMRC references are coming from, or understand why would they be involved in standards.
You're right, getting my HM(*) mixed up, but HMRC were mentioned and I don't know if it was a mistake when HMT was meant, or whether they also have a schema.
The LGA page with the spend CSV schema, guidance and other resources is at http://schemas.opendata.esd.org.uk/Spend
I believe SpendNetwork was consulted along with lots of councils. I'll ask them for comment.
@MikeThacker1 is there a reason why the ESD schema is so different from the HMT one? I understand a slightly different purpose, but am just trying to get my head around what might be missing from each.
The ESD / LGA one was designed to meet the requirements of the "Local government transparency code" in 2014, updated in 2015. See https://www.gov.uk/government/publications/local-government-transparency-code-2015
DCLG did not want to be too prescriptive in how LAs should record things so allowed flexibility in how each requirement is met (it would have been easier if there had been less flexibility). Also some differences in how local government operates.
That said, I'm not sure how much the HMT one was referenced. I've asked LGA people if they might chip in.
Five years ago, I found an HMT Guidance document, the same one @davidread linked to above. By mistake I referred to it as an HMRC guidance yesterday.
In 2011, I wrote a spending data CSV validator, which I ran over spending data files found on data gov uk via this query: http://data.gov.uk/search/apachesolr_search/spend%20over?filters=type:ckan_package
The validator checked whether files were valid CSV, and whether mandatory headers specified in the Guidance were in the first row. It produced a page reporting which files conformed and which had problems.
At the time my idea was that some part of government e.g. the National Audit Office, could run the validator and notify publishers when their files do not conform to the HMT mandated format.
I love your idea of seeing these things as an 'audit', the same way as any legit company provides accounts in standard format and is audited. I've no idea if the NAO could be interested in this.
However I'm also keen that checking is done at the earliest opportunity, so that the feedback loop is as strong as possible. As soon as you add a delay in time and place on the web then it's not as powerful. It's an obvious thing to do schema checking when adding the data file to the central publishing infrastructure, which means gov.uk Publications and/or data.gov.uk.
Ideally we should help organisations validate their data prior to publishing. And consider blocking publication when validation checks fail.
Thanks to the Internet Archive, you can see the UK Spending Data CSV validation report I generated in 2011. For the files analysed the validation report breakdown was:
Good Data - 36% All mandatory headers in first row - 1,132 files
Partial Data - 19% Some mandatory headers in first row - 605 files
Bad Data - 45% No standard headers in first row - 795 files 25% Errors parsing file as CSV - 507 files 16% File not found - 112 files 4%
@robmckinnon I've added a ticket at https://github.com/datagovuk/ckanext-dgu/issues/416 about discussing the feasibility of adding this to DGU for when people add the metadata. Unfortunately it won't solve the problem as they will already have uploaded the actual content to gov.uk, but if we have some code to share, then that might help encourage its use.
It looks like there are schemas (schemata? schemae?) available which have been published as official guidance. Therefore I'm closing this issue.
If you think the Standards team should take another look at this, please let me know.
Recently the topic of spend data has come up, particularly how everyone seems to publish their data in slightly different structures, sometimes the same organisation using a different structure each month.
I was aware of the Local Government Association's schema ( https://github.com/esd-org-uk/schemas/blob/master/Spend/Spend.json ) and a HMRC schema was mentioned. I haven't found the HMRC schema yet, although I am presuming it is core-department specific, so if anyone has any pointers.
There's likely to be a problem persuading people to use a specific schema, but we should at least have a schema to suggest and to that end, I'm hoping to gather opinions on the best approach for this. Should we be asking people like https://www.spendnetwork.com/ for guidance on what they would expect? Could @torgo arrange this if so, as I believe they are at ODI.
This might be of interest to @davidread as he's had some experience with the https://openspending.org/ codebase.