cagov / caldata-mdsa-caltrans-pems

CalData's MDSA project with Caltrans on Performance Measurement System (PeMS) data
https://cagov.github.io/caldata-mdsa-caltrans-pems/
MIT License
7 stars 0 forks source link

Investigate missing data in data relay server #453

Closed summer-mothwood closed 2 weeks ago

summer-mothwood commented 3 weeks ago

As noted in #423 , the data relay server seems to have missing data in it across all districts (when using the clearinghouse data as a benchmark). The full analysis can be found in this Snowflake notebook: https://app.snowflake.com/vsb79059/dse_caltrans_pems/#/notebooks/TRANSFORM_DEV.PUBLIC.DATA_RELAY_UNION_TEST_STATIONS_SAMPLE

This ticket is to investigate:

This code can be used to compare the number of observations between the clearinghouse and the data relay server, which is a good benchmark for now, but we'll eventually want to be able to detect missing data in the data relay server without relying on clearinghouse as a source of truth:


with sample_ids as
    (select id, district from (select * from raw_prd.clearinghouse.station_meta where district = '3') sample (331 rows) union
    select id, district from (select * from raw_prd.clearinghouse.station_meta where district = '4') sample (360 rows) union
    select id, district from (select * from raw_prd.clearinghouse.station_meta where district = '5') sample (250 rows) union
    select id, district from (select * from raw_prd.clearinghouse.station_meta where district = '6') sample (272 rows) union
    select id, district from (select * from raw_prd.clearinghouse.station_meta where district = '7') sample (359 rows) union
    select id, district from (select * from raw_prd.clearinghouse.station_meta where district = '8') sample (340 rows) union
    select id, district from (select * from raw_prd.clearinghouse.station_meta where district = '10') sample (310 rows) union
    select id, district from (select * from raw_prd.clearinghouse.station_meta where district = '11') sample (323 rows) union
    select id, district from (select * from raw_prd.clearinghouse.station_meta where district = '12') sample (344 rows)),

    clearinghouse as 
    (select sample_date, count(distinct sample_timestamp, id) as c, 'clearinghouse' as server
    from raw_prd.clearinghouse.station_raw
    where sample_date > '2024-04-23' and ID in (select id from sample_ids)
    group by  sample_date),

    data_relay as
    (select sample_date, count(distinct sample_time, vds_id) as c, 'data_relay' as server from raw_prd.db96.vds30sec
    where sample_date > '2024-04-23' and vds_ID in (select id from sample_ids)
    group by sample_date)

select sample_date, c, server from clearinghouse UNION select sample_date, c, server from data_relay
order by sample_date
pingpingxiu-DOT-ca-gov commented 3 weeks ago

Thanks @summer-mothwood . Based on your findings, I propose:

  1. On Kibana, compare the Data Relay counts with Clearhouse on a Hourly basis recently. Now we have ES Kibana, we can pull both DB96 and Snowflake tables to Kibana and comparing side by side. This would address both #454 and #423 you created.

    If the latest compare reveals gaps, we can troubleshoot accordingly.

  2. Past data holes need patching, to a meaningful extent.

    Snowflake can retrieve the list of windows that need to re-crawl by Data Relay. (We need to fix idempotency issue on Data Relay first.)

@ZhenyuZhu-Caltrans @ian-r-rose

summer-mothwood commented 3 weeks ago

@pingpingxiu-DOT-ca-gov I agree, I think pulling both SB96 and Snowflake tables to Kibana to compare would be a great way to monitor for data issues for now. But for #454 we also want to create checks that would identify missing db96 data in snowflake without needing to rely on clearinghouse for comparison.

summer-mothwood commented 2 weeks ago

The conversation / work for this ticket has been happening in #423, so in interest of clarity, we're closing this ticket and re-opening #423