cagov / caldata-mdsa-caltrans-pems

CalData's MDSA project with Caltrans on Performance Measurement System (PeMS) data
https://cagov.github.io/caldata-mdsa-caltrans-pems/
MIT License
7 stars 1 forks source link

Deduplicate data relay VDS data #274

Closed ian-r-rose closed 4 months ago

ian-r-rose commented 5 months ago

The 30-second data relay data may sometimes contain duplicate sample data, especially if we are recovering from incidents or backfilling.

We should insert some logic into the staging model to deal with this possibly-duplicated-data.

pingpingxiu-DOT-ca-gov commented 5 months ago

I'm thinking of using snowsql to do that. Approach 1: a sql query to drop the samples with duplications. Approach 2: create a new table, so the deduped data will be in the new table, rather than in-place removal as Approach 1.

What's your thoughts? @ian-r-rose

pingpingxiu-DOT-ca-gov commented 4 months ago

https://github.com/cagov/caldata-mdsa-caltrans-pems/blob/ed2605f906e30bb5c9c788bdd90dc46e6c09f019/transform/models/intermediate/diagnostics/int_diagnostics__real_detector_status.sql#L7-L16

ian-r-rose commented 4 months ago

Closing as completed by #304