The new Data Quality Score algorithm code and documentation appears to have some potential errors. And we should consider reviewing the solution to ensure that all required fields are being score and optional fields are weighted less in point value.

Algorithm Documentation:

https://github.com/GSA/code-gov/blob/master/data_quality_scoring.md

Algorithm Rules:

https://github.com/GSA/code-gov-harvester/blob/master/libs/rules/index.js

The documentation shows the field's point value assignment for the Metadata Schema 2.0.0 required and optional fields. The screenshot below shows the documentation point assignment per field, and the source code list for fields that are being evaluated on reach repo.

Screen Shot 2019-05-30 at 10 06 36 AM

The solution is missing the following fields:

  agency
  measurementType
  releases

Keep in mind that "organization" is an optional field nested under "releases".

Also, the algorithm evaluates each repo on all required and all options fields, not just the required fields. Meaning that each repo is graded on a 158 total points scale, not just the 71 required fields total points.

 (Repo Total Points / 158 ) * 10 = repo score.

Perhaps consider scaling the optional fields as a bonus point value to the overall score?

i.e.

 (Repo Required Fields Points / 71 ) * 10 + (Repo Optional Fields Points / 87)*10 = repo score.

Or consider alternative solution in which optional fields are not negatively impacting the required fields point score. Not all agencies are populating their code.json metadata file with optional fields on their releases/repos and that impacts their overall Data Quality Score.

GSA / code-gov-harvester

Data Quality Score Algorithm - Potential Issue #20

Algorithm Documentation:

Algorithm Rules:

The solution is missing the following fields:

Metadata Schema 2.0.0 Requirements

Good Metadata Examples