quality_data: Rework the aggregated to publisher value for grants indvi

michaelwood commented 1 year ago

Count the number of instances of the issue and make sure we use the total number of grants to orgs/individuals depending on the quality test.

michaelwood commented 1 year ago

Coverage fail not a problem in this instance

michaelwood commented 1 year ago

@KDuerden - "a heads up" on this one:

This work has had an affect on the way the publisher overview stats have are calculated. The new values are slightly different[1] and are more consistent with the approach we have with the per-publisher quality stats.

On a publisher page e.g. https://qualitydashboard.threesixtygiving.org/publisher/360G-10GM we say a publisher gets the badge e.g. "includes programme names" if at least one of their datasets contains at least one programme name.

In the old code we then totalled up the number of errors for each feature (like "50 of your grants don't have a programme name") per publisher and divided this by the total number of grants for that publisher. These counts were then aggregated to the different levels; per data file, per publisher, all publishers, all grants. However this logic doesn't work after we say that "for some grants in some datasets" the badge might not be applicable or only applicable to a subset of the data (or at least, it becomes complicated and error prone quite quckly, as I found out).

In the new code I just count the number of badges that a publisher has which uses the new data context aware code (i.e. whether the badge is applicable at the source file or not).

There are still some issues to work through on these stats and we have a planio ticket open to do a deeper dive into the different methods of calculating them which we'll need to come back to.

[1]

    "quality": {
-       "hasBeneficiaryLocationName": 67,
-       "hasGrantDuration": 56,
-       "hasGrantProgrammeTitle": 71,
-       "hasGrantClassification": 10,
-       "hasBeneficiaryLocationGeoCode": 38,
-       "hasRecipientOrgLocations": 67,
-       "hasRecipientOrgCompanyOrCharityNumber": 93,
-       "has50pcExternalOrgId": 94,
+       "hasBeneficiaryLocationName": 61,
+       "hasGrantDuration": 49,
+       "hasGrantProgrammeTitle": 66,
+       "hasGrantClassification": 7,
+       "hasBeneficiaryLocationGeoCode": 32,
+       "hasRecipientOrgLocations": 62,
+       "hasRecipientOrgCompanyOrCharityNumber": 89,
+       "has50pcExternalOrgId": 89,
        "hasRecipientIndividualsCodelists": 100
    }

KDuerden commented 1 year ago

@michaelwood thanks for the explanation.

To check I understand, your update now gives each file a badge based on whether its contains any data under the field(s) in question?

As some funders have multiple files, how is the overall badge decided? Is any badge for any file enough to give a badge overall?

michaelwood commented 1 year ago

@michaelwood thanks for the explanation.

To check I understand, your update now gives each file a badge based on whether its contains any data under the field(s) in question?

As some funders have multiple files, how is the overall badge decided? Is any badge for any file enough to give a badge overall?

The logic that decides if each publisher gets a badge on Publishers https://qualitydashboard.threesixtygiving.org/publishers hasn't changed.
- [1] That works by running each file through the dataquality tool's usefulness checks, if they aren't a 100% fail (e.g. < 100% grants across all their files have no programme name then the publisher is awarded the "has programme name" badge ) then they get the badge.
The logic that shows the summary numbers of all publishers on All data https://qualitydashboard.threesixtygiving.org/alldata#publishers has changed.
- This now works by aggregating the pass/fails of the publishers above [1] instead of trying to total up the number of errors (e.g. 40 of 500 grants for publisher A do/don't have programme name sum these up and divide by the total publishers)

This brings the two approaches on each page into line with each other and hopefully will make it simpler to maintain and change/update in the future.

KDuerden commented 1 year ago

Thank you.

So there is still the question about why there is a difference between the internal dashboard stats (which appear to be based on publishers having >0 of a feature in their data)

For reference the stats from that (which are taken from the coverage files iirc) are as follows:

"hasBeneficiaryLocationName": 66, "hasGrantDuration": 62, "hasGrantProgrammeTitle": 70, "hasGrantClassification": 9, "hasBeneficiaryLocationGeoCode": 38, "hasRecipientOrgLocations": 67, "hasRecipientOrgCompanyOrCharityNumber": 93, "has50pcExternalOrgId": 91,

michaelwood commented 1 year ago

Thank you.

So there is still the question about why there is a difference between the internal dashboard stats (which appear to be based on publishers having >0 of a feature in their data)

For reference the stats from that (which are taken from the coverage files iirc) are as follows:

"hasBeneficiaryLocationName": 66, "hasGrantDuration": 62, "hasGrantProgrammeTitle": 70, "hasGrantClassification": 9, "hasBeneficiaryLocationGeoCode": 38, "hasRecipientOrgLocations": 67, "hasRecipientOrgCompanyOrCharityNumber": 93, "has50pcExternalOrgId": 91,

Yeah, we've identified a few rules around the quality metrics that are slightly different compared to the internal dash. We have an upcoming mini project to look at some of these again and decide which we want to change.

ThreeSixtyGiving / datastore

quality_data: Rework the aggregated to publisher value for grants indvi #154