Closed summer-mothwood closed 2 weeks ago
Thanks @summer-mothwood . Based on your findings, I propose:
On Kibana, compare the Data Relay counts with Clearhouse on a Hourly basis recently. Now we have ES Kibana, we can pull both DB96 and Snowflake tables to Kibana and comparing side by side. This would address both #454 and #423 you created.
If the latest compare reveals gaps, we can troubleshoot accordingly.
Past data holes need patching, to a meaningful extent.
Snowflake can retrieve the list of windows that need to re-crawl by Data Relay. (We need to fix idempotency issue on Data Relay first.)
@ZhenyuZhu-Caltrans @ian-r-rose
@pingpingxiu-DOT-ca-gov I agree, I think pulling both SB96 and Snowflake tables to Kibana to compare would be a great way to monitor for data issues for now. But for #454 we also want to create checks that would identify missing db96 data in snowflake without needing to rely on clearinghouse for comparison.
The conversation / work for this ticket has been happening in #423, so in interest of clarity, we're closing this ticket and re-opening #423
As noted in #423 , the data relay server seems to have missing data in it across all districts (when using the clearinghouse data as a benchmark). The full analysis can be found in this Snowflake notebook: https://app.snowflake.com/vsb79059/dse_caltrans_pems/#/notebooks/TRANSFORM_DEV.PUBLIC.DATA_RELAY_UNION_TEST_STATIONS_SAMPLE
This ticket is to investigate:
This code can be used to compare the number of observations between the clearinghouse and the data relay server, which is a good benchmark for now, but we'll eventually want to be able to detect missing data in the data relay server without relying on clearinghouse as a source of truth: