The new Data Quality Score algorithm code and documentation appears to have some potential errors. And we should consider reviewing the solution to ensure that all required fields are being score and optional fields are weighted less in point value.
The documentation shows the field's point value assignment for the Metadata Schema 2.0.0 required and optional fields. The screenshot below shows the documentation point assignment per field, and the source code list for fields that are being evaluated on reach repo.
The solution is missing the following fields:
agency
measurementType
releases
Keep in mind that "organization" is an optional field nested under "releases".
Also, the algorithm evaluates each repo on all required and all options fields, not just the required fields. Meaning that each repo is graded on a 158 total points scale, not just the 71 required fields total points.
(Repo Total Points / 158 ) * 10 = repo score.
Perhaps consider scaling the optional fields as a bonus point value to the overall score?
Or consider alternative solution in which optional fields are not negatively impacting the required fields point score. Not all agencies are populating their code.json metadata file with optional fields on their releases/repos and that impacts their overall Data Quality Score.
The new Data Quality Score algorithm code and documentation appears to have some potential errors. And we should consider reviewing the solution to ensure that all required fields are being score and optional fields are weighted less in point value.
Algorithm Documentation:
https://github.com/GSA/code-gov/blob/master/data_quality_scoring.md
Algorithm Rules:
https://github.com/GSA/code-gov-harvester/blob/master/libs/rules/index.js
The documentation shows the field's point value assignment for the Metadata Schema 2.0.0 required and optional fields. The screenshot below shows the documentation point assignment per field, and the source code list for fields that are being evaluated on reach repo.
The solution is missing the following fields:
Keep in mind that "organization" is an optional field nested under "releases".
Also, the algorithm evaluates each repo on all required and all options fields, not just the required fields. Meaning that each repo is graded on a 158 total points scale, not just the 71 required fields total points.
Perhaps consider scaling the optional fields as a bonus point value to the overall score?
i.e.
Or consider alternative solution in which optional fields are not negatively impacting the required fields point score. Not all agencies are populating their code.json metadata file with optional fields on their releases/repos and that impacts their overall Data Quality Score.