As part of the data quality measures implemented for this project, we have developed SQL code that performs row counts and other data quality checks to make sure the correct number of records are included in the various models and that the values in the models meet various data quality measures. While a number of tests are implemented in the yml files, we have also developed SQL worksheets in Snowflake that perform checks not included in the yml files. Currently the worksheets have been developed for the intermediate diagnostic, clearinghouse, imputation and performance schemas to perform these data quality checks.
The results of these data quality checks can be used to implement additional tests in the yml files as well as check the results against other data sources. Below are some tests that need to be implemented and verified, these issues will be tracked as their own issues with additional details as needed:
[ ] Diagnostic models should contain the correct number of rows associated with active detectors and their associated stations on a daily basis (@kengodleskidot is the lead) #398
[x] Clearinghouse models should contain the correct number of rows per detector on a daily basis with observed data - (@kengodleskidot is the lead) #397
[x] Imputation models should contain the correct number of rows per detector on a daily basis and fill in all data holes with either observed or imputed data (@mmmiah is the lead) #404
[ ] Performance models should contain the correct number of rows per detector on a daily basis and should contain no data holes for the performance metric values VMT, VHT, Q, TTI, Delay and Productivity (@kengodleskidot is the lead) #465
[x] @kengodleskidot and @pingpingxiu-DOT-ca-gov will develop data quality checks to compare the modernized PeMS data sets against the existing PeMS data sets. - This is being handled by issue #413 so crossing off on this list (see for additional details)
As part of the data quality measures implemented for this project, we have developed SQL code that performs row counts and other data quality checks to make sure the correct number of records are included in the various models and that the values in the models meet various data quality measures. While a number of tests are implemented in the yml files, we have also developed SQL worksheets in Snowflake that perform checks not included in the yml files. Currently the worksheets have been developed for the intermediate diagnostic, clearinghouse, imputation and performance schemas to perform these data quality checks.
The results of these data quality checks can be used to implement additional tests in the yml files as well as check the results against other data sources. Below are some tests that need to be implemented and verified, these issues will be tracked as their own issues with additional details as needed: