Open Bjwebb opened 6 years ago
I would like to add (from discussion with @kindly & @Bjwebb) -
By this I refer to the example whereby a single grant might have multiple locations (for example). If we only had a count of the instances of the fields then we might miss the potential to see if some grants are more populous than others.
eg: a dataset with a 1000 grants could have 1000 locations, but only in 1 grant
There's now a pull request at https://github.com/ThreeSixtyGiving/datagetter/pull/22
What's done:
What's not done:
Here's the coverage.json generated by that branch https://storage.googleapis.com/datagetter-360giving-output/branch/11-coverage/coverage.json
Here's a snippet of what the coverage data looks like:
"datagetter_coverage": {
"/grants": {
"grants_with_field": null,
"standard": true,
"total_fields": 1
},
"/grants/Grant Application Scheme": {
"grants_with_field": 66,
"standard": false,
"total_fields": 66
},
"/grants/amountAwarded": {
"grants_with_field": 228,
"standard": true,
"total_fields": 228
},
"/grants/awardDate": {
"grants_with_field": 228,
"standard": true,
"total_fields": 228
},
"/grants/beneficiaryLocation": {
"grants_with_field": 228,
"standard": true,
"total_fields": 228
},
"/grants/beneficiaryLocation/countryCode": {
"grants_with_field": 210,
"standard": true,
"total_fields": 221
},
"/grants/beneficiaryLocation/name": {
"grants_with_field": 228,
"standard": true,
"total_fields": 239
},
@stevieflow @drkane How does this look?
Looks great.
Would it always be a separate file or would it populate the status.json file? We would presumably then end up with a three-tiered approach: data.json
has the main registry in DCAT format, status.json
has the additional datagetter_metadata
and datagetter_aggregates
and then coverage.json
also includes datagetter_coverage
.
Something that could cause confusion is that the fields are identified using the proper schema representation, but in the file they are likely to be in the Recipient Org:Name
format. It would be helpful to have the original field name, perhaps as a name
key in the object for each field, to help map these fields to the data itself. Not sure if that's possible with how the coverage file is generated at the moment though.
Just looked at coverage.py
, looks like it's probably not possible.
We would presumably then end up with a three-tiered approach: data.json has the main registry in DCAT format, status.json has the additional datagetter_metadata and datagetter_aggregates and then coverage.json also includes datagetter_coverage.
I'd be interested to know if this is the direction we take, as have made similar comments elsewhere. @drkane - as a User (!) - would you find this useful>?
@drkane wrote:
Something that could cause confusion is that the fields are identified using the proper schema representation, but in the file they are likely to be in the
Recipient Org:Name
format. It would be helpful to have the original field name, perhaps as aname
key in the object for each field, to help map these fields to the data itself. Not sure if that's possible with how the coverage file is generated at the moment though.Just looked at
coverage.py
, looks like it's probably not possible.
Yeah this would be reasonably involved to do. Options are to reconstruct the title by looking in the schema, or to get flatten-tool to produce its "heading source map" (this is how CoVE knows the real name of the column for validation errors).
We would presumably then end up with a three-tiered approach: data.json has the main registry in DCAT format, status.json has the additional datagetter_metadata and datagetter_aggregates and then coverage.json also includes datagetter_coverage.
I'd be interested to know if this is the direction we take, as have made similar comments elsewhere. @drkane - as a User (!) - would you find this useful>?
@drkane Would be useful to know how well you think it works. The separation is partly based on what you said on a call about maybe keeping the data in status.json
small enough that you can read through the file more easily.
@drkane Would be useful to know how well you think it works. The separation is partly based on what you said on a call about maybe keeping the data in status.json small enough that you can read through the file more easily.
For me personally, I'd be using the most detailed version of the data, which is currently coverage.json
. Not sure if file size will become a problem - I was thinking aloud when we were talking about it, and actually the file size doesn't get that big.
Where it might make sense to have 3 separate files is that if there is a data.json
> status.json
> coverage.json
separation you can tell what processes have been run on each file, and so what state you can expect it to be in. If it overwrites the file you wouldn't be able to tell what status.json
contains at a glance, depending on which script has been run.
Bit of a new comer to this issue, but the work that was referenced in the related pull request has now been merged so closing this issue. If in error please feel free to re-open.
Great that this has now been merged @michaelwood.
One issue I've noticed since it went live is that coverage.json
doesn't build on the exact same data as status.json
, so there are inconsistencies between them.
The example I found was in the large National Lottery Community Fund CSV file, so presumably it's related to the caching that happens to avoid processing that file every time. The datagetter_metadata
section is available in status.json
but not in coverage.json
. Presumably the cached data is brought in at a point after coverage.json is produced.
(It would also be good if a cached version of datagetter_coverage
was available for this file.
Looks like equivalent code for coverage.json
would need to be added to https://github.com/ThreeSixtyGiving/datatester/blob/master/big_files_hardcoded/insert.py
I think this is an ordering issue specific to the way Travis is used to run the test. At the moment it is setup like this:
script:
# Don't convert big file becasue Travis doesn't have the RAM`
- travis_wait 60 ./run.sh --no-convert-big-files`
- python big_files_hardcoded/insert.py
This means that all the data json documents created via run.sh python scripts will be without the National lottery data in, which as you mentioned gets inserted later.
As an experiment I'll see if we can get Travis to re-run coverage.py
after insert.py
to regenerate the coverage.json
Had to make a couple of modifications to coverage to make sure it doesn't trip up on failed downloads. The test branch seems to output as expected. https://storage.googleapis.com/datagetter-360giving-output/branch/mw/test_issue_11_coverage/coverage.json here is a diff between the master branch and my modifications with the ordering issue fixed: https://gist.github.com/michaelwood/48697cd3bbfa7df892ee76f905dcde30 Think that looks right?
I don't think the output is quite right - the big file is being skipped rather than included (here in the diff). I'm not entirely sure why - it's in the status.json
file and the code looks like it should just pass it through if it's not found in coverage.py
.
Might need to check the file exists here: https://github.com/ThreeSixtyGiving/datatester/blob/master/coverage.py#L35
@drkane yeah that was my thought, if you see the branch that was what I added https://github.com/ThreeSixtyGiving/datatester/blob/mw/test_issue_11_coverage/coverage.py#L36
I'll check in with @Bjwebb
Ah sorry, missed that. Maybe it's a case of doing stats.append(dataset)
if the exception is triggered.
I've run it locally instead of on travis (which was being hit by some time out issues yesterday) and I get a much larger result. Diff is here https://gist.github.com/michaelwood/48697cd3bbfa7df892ee76f905dcde30 full file is here https://gist.github.com/michaelwood/1390d9dad228730a072c95a8f3fcfe5a
As we now have a new server with the data on it might be better to move away from using Travis and use the new server with a cron job, that way we can get away from any issues that might be introduced in the Travis environment such as memory and disk space and execution time restrictions.
@michaelwood given that the server has been set up without using Salt, we're going to have to rebuild it at least once before we can use it for production infrastructure. The daily data test is used by 360Giving Insights (at least) so we need to make sure that we can rebuild the server that provides it quickly and easily if it ever goes down, and we can only really have confidence in that process if we've done it! So, while using the new server for the DDT in the future (as it has loads more resources than Travis!) is probably the right thing to do, I don't want us to end up in a state where we can't do the Salt development because we're using the server for production services!
Suggested by @drkane in ThreeSixtyGiving/test_registry#5