Investigate missing data in data relay server

summer-mothwood commented 3 weeks ago

As noted in #423 , the data relay server seems to have missing data in it across all districts (when using the clearinghouse data as a benchmark). The full analysis can be found in this Snowflake notebook: https://app.snowflake.com/vsb79059/dse_caltrans_pems/#/notebooks/TRANSFORM_DEV.PUBLIC.DATA_RELAY_UNION_TEST_STATIONS_SAMPLE

This ticket is to investigate:

why is there missing data?
how do we fix the data relay server pipeline moving forward?
is it worth the effort to backfill previous missing data?

This code can be used to compare the number of observations between the clearinghouse and the data relay server, which is a good benchmark for now, but we'll eventually want to be able to detect missing data in the data relay server without relying on clearinghouse as a source of truth:


with sample_ids as
    (select id, district from (select * from raw_prd.clearinghouse.station_meta where district = '3') sample (331 rows) union
    select id, district from (select * from raw_prd.clearinghouse.station_meta where district = '4') sample (360 rows) union
    select id, district from (select * from raw_prd.clearinghouse.station_meta where district = '5') sample (250 rows) union
    select id, district from (select * from raw_prd.clearinghouse.station_meta where district = '6') sample (272 rows) union
    select id, district from (select * from raw_prd.clearinghouse.station_meta where district = '7') sample (359 rows) union
    select id, district from (select * from raw_prd.clearinghouse.station_meta where district = '8') sample (340 rows) union
    select id, district from (select * from raw_prd.clearinghouse.station_meta where district = '10') sample (310 rows) union
    select id, district from (select * from raw_prd.clearinghouse.station_meta where district = '11') sample (323 rows) union
    select id, district from (select * from raw_prd.clearinghouse.station_meta where district = '12') sample (344 rows)),

    clearinghouse as 
    (select sample_date, count(distinct sample_timestamp, id) as c, 'clearinghouse' as server
    from raw_prd.clearinghouse.station_raw
    where sample_date > '2024-04-23' and ID in (select id from sample_ids)
    group by  sample_date),

    data_relay as
    (select sample_date, count(distinct sample_time, vds_id) as c, 'data_relay' as server from raw_prd.db96.vds30sec
    where sample_date > '2024-04-23' and vds_ID in (select id from sample_ids)
    group by sample_date)

select sample_date, c, server from clearinghouse UNION select sample_date, c, server from data_relay
order by sample_date

pingpingxiu-DOT-ca-gov commented 3 weeks ago

Thanks @summer-mothwood . Based on your findings, I propose:

On Kibana, compare the Data Relay counts with Clearhouse on a Hourly basis recently. Now we have ES Kibana, we can pull both DB96 and Snowflake tables to Kibana and comparing side by side. This would address both #454 and #423 you created.

If the latest compare reveals gaps, we can troubleshoot accordingly.
Past data holes need patching, to a meaningful extent.

Snowflake can retrieve the list of windows that need to re-crawl by Data Relay. (We need to fix idempotency issue on Data Relay first.)

@ZhenyuZhu-Caltrans @ian-r-rose

summer-mothwood commented 3 weeks ago

@pingpingxiu-DOT-ca-gov I agree, I think pulling both SB96 and Snowflake tables to Kibana to compare would be a great way to monitor for data issues for now. But for #454 we also want to create checks that would identify missing db96 data in snowflake without needing to rely on clearinghouse for comparison.

summer-mothwood commented 2 weeks ago

The conversation / work for this ticket has been happening in #423, so in interest of clarity, we're closing this ticket and re-opening #423

cagov / caldata-mdsa-caltrans-pems

Investigate missing data in data relay server #453