ThreeSixtyGiving / datatester

Scripts to asses the quality of data from http://data.threesixtygiving.org
MIT License
3 stars 2 forks source link

Add coverage data to status.json #11

Open Bjwebb opened 6 years ago

Bjwebb commented 6 years ago

Suggested by @drkane in ThreeSixtyGiving/test_registry#5

stevieflow commented 6 years ago

I would like to add (from discussion with @kindly & @Bjwebb) -

By this I refer to the example whereby a single grant might have multiple locations (for example). If we only had a count of the instances of the fields then we might miss the potential to see if some grants are more populous than others.

eg: a dataset with a 1000 grants could have 1000 locations, but only in 1 grant

Bjwebb commented 6 years ago

There's now a pull request at https://github.com/ThreeSixtyGiving/datagetter/pull/22

What's done:

What's not done:

Here's the coverage.json generated by that branch https://storage.googleapis.com/datagetter-360giving-output/branch/11-coverage/coverage.json

Here's a snippet of what the coverage data looks like:

    "datagetter_coverage": {
      "/grants": {
        "grants_with_field": null,
        "standard": true,
        "total_fields": 1
      },
      "/grants/Grant Application Scheme": {
        "grants_with_field": 66,
        "standard": false,
        "total_fields": 66
      },
      "/grants/amountAwarded": {
        "grants_with_field": 228,
        "standard": true,
        "total_fields": 228
      },
      "/grants/awardDate": {
        "grants_with_field": 228,
        "standard": true,
        "total_fields": 228
      },
      "/grants/beneficiaryLocation": {
        "grants_with_field": 228,
        "standard": true,
        "total_fields": 228
      },
      "/grants/beneficiaryLocation/countryCode": {
        "grants_with_field": 210,
        "standard": true,
        "total_fields": 221
      },
      "/grants/beneficiaryLocation/name": {
        "grants_with_field": 228,
        "standard": true,
        "total_fields": 239
      },

@stevieflow @drkane How does this look?

drkane commented 6 years ago

Looks great.

Would it always be a separate file or would it populate the status.json file? We would presumably then end up with a three-tiered approach: data.json has the main registry in DCAT format, status.json has the additional datagetter_metadata and datagetter_aggregates and then coverage.json also includes datagetter_coverage.

Something that could cause confusion is that the fields are identified using the proper schema representation, but in the file they are likely to be in the Recipient Org:Name format. It would be helpful to have the original field name, perhaps as a name key in the object for each field, to help map these fields to the data itself. Not sure if that's possible with how the coverage file is generated at the moment though.

drkane commented 6 years ago

Just looked at coverage.py, looks like it's probably not possible.

stevieflow commented 6 years ago

We would presumably then end up with a three-tiered approach: data.json has the main registry in DCAT format, status.json has the additional datagetter_metadata and datagetter_aggregates and then coverage.json also includes datagetter_coverage.

I'd be interested to know if this is the direction we take, as have made similar comments elsewhere. @drkane - as a User (!) - would you find this useful>?

Bjwebb commented 6 years ago

@drkane wrote:

Something that could cause confusion is that the fields are identified using the proper schema representation, but in the file they are likely to be in the Recipient Org:Name format. It would be helpful to have the original field name, perhaps as a name key in the object for each field, to help map these fields to the data itself. Not sure if that's possible with how the coverage file is generated at the moment though.

Just looked at coverage.py, looks like it's probably not possible.

Yeah this would be reasonably involved to do. Options are to reconstruct the title by looking in the schema, or to get flatten-tool to produce its "heading source map" (this is how CoVE knows the real name of the column for validation errors).

Bjwebb commented 6 years ago

We would presumably then end up with a three-tiered approach: data.json has the main registry in DCAT format, status.json has the additional datagetter_metadata and datagetter_aggregates and then coverage.json also includes datagetter_coverage.

I'd be interested to know if this is the direction we take, as have made similar comments elsewhere. @drkane - as a User (!) - would you find this useful>?

@drkane Would be useful to know how well you think it works. The separation is partly based on what you said on a call about maybe keeping the data in status.json small enough that you can read through the file more easily.

drkane commented 6 years ago

@drkane Would be useful to know how well you think it works. The separation is partly based on what you said on a call about maybe keeping the data in status.json small enough that you can read through the file more easily.

For me personally, I'd be using the most detailed version of the data, which is currently coverage.json. Not sure if file size will become a problem - I was thinking aloud when we were talking about it, and actually the file size doesn't get that big.

Where it might make sense to have 3 separate files is that if there is a data.json > status.json > coverage.json separation you can tell what processes have been run on each file, and so what state you can expect it to be in. If it overwrites the file you wouldn't be able to tell what status.json contains at a glance, depending on which script has been run.

michaelwood commented 5 years ago

Bit of a new comer to this issue, but the work that was referenced in the related pull request has now been merged so closing this issue. If in error please feel free to re-open.

drkane commented 5 years ago

Great that this has now been merged @michaelwood.

One issue I've noticed since it went live is that coverage.json doesn't build on the exact same data as status.json, so there are inconsistencies between them.

The example I found was in the large National Lottery Community Fund CSV file, so presumably it's related to the caching that happens to avoid processing that file every time. The datagetter_metadata section is available in status.json but not in coverage.json. Presumably the cached data is brought in at a point after coverage.json is produced.

(It would also be good if a cached version of datagetter_coverage was available for this file.

drkane commented 5 years ago

Looks like equivalent code for coverage.json would need to be added to https://github.com/ThreeSixtyGiving/datatester/blob/master/big_files_hardcoded/insert.py

michaelwood commented 5 years ago

I think this is an ordering issue specific to the way Travis is used to run the test. At the moment it is setup like this:

script: 
  # Don't convert big file becasue Travis doesn't have the RAM`
  - travis_wait 60 ./run.sh --no-convert-big-files`
  - python big_files_hardcoded/insert.py

This means that all the data json documents created via run.sh python scripts will be without the National lottery data in, which as you mentioned gets inserted later.

As an experiment I'll see if we can get Travis to re-run coverage.py after insert.py to regenerate the coverage.json

michaelwood commented 5 years ago

Had to make a couple of modifications to coverage to make sure it doesn't trip up on failed downloads. The test branch seems to output as expected. https://storage.googleapis.com/datagetter-360giving-output/branch/mw/test_issue_11_coverage/coverage.json here is a diff between the master branch and my modifications with the ordering issue fixed: https://gist.github.com/michaelwood/48697cd3bbfa7df892ee76f905dcde30 Think that looks right?

drkane commented 5 years ago

I don't think the output is quite right - the big file is being skipped rather than included (here in the diff). I'm not entirely sure why - it's in the status.json file and the code looks like it should just pass it through if it's not found in coverage.py.

drkane commented 5 years ago

Might need to check the file exists here: https://github.com/ThreeSixtyGiving/datatester/blob/master/coverage.py#L35

michaelwood commented 5 years ago

@drkane yeah that was my thought, if you see the branch that was what I added https://github.com/ThreeSixtyGiving/datatester/blob/mw/test_issue_11_coverage/coverage.py#L36

I'll check in with @Bjwebb

drkane commented 5 years ago

Ah sorry, missed that. Maybe it's a case of doing stats.append(dataset) if the exception is triggered.

michaelwood commented 5 years ago

I've run it locally instead of on travis (which was being hit by some time out issues yesterday) and I get a much larger result. Diff is here https://gist.github.com/michaelwood/48697cd3bbfa7df892ee76f905dcde30 full file is here https://gist.github.com/michaelwood/1390d9dad228730a072c95a8f3fcfe5a

As we now have a new server with the data on it might be better to move away from using Travis and use the new server with a cron job, that way we can get away from any issues that might be introduced in the Travis environment such as memory and disk space and execution time restrictions.

robredpath commented 5 years ago

@michaelwood given that the server has been set up without using Salt, we're going to have to rebuild it at least once before we can use it for production infrastructure. The daily data test is used by 360Giving Insights (at least) so we need to make sure that we can rebuild the server that provides it quickly and easily if it ever goes down, and we can only really have confidence in that process if we've done it! So, while using the new server for the DDT in the future (as it has loads more resources than Travis!) is probably the right thing to do, I don't want us to end up in a state where we can't do the Salt development because we're using the server for production services!