cagov / caldata-mdsa-caltrans-pems

CalData's MDSA project with Caltrans on Performance Measurement System (PeMS) data
https://cagov.github.io/caldata-mdsa-caltrans-pems/
MIT License
5 stars 0 forks source link

Revamp active_stations and detector_status models once VDS metadata lands #226

Closed ian-r-rose closed 1 week ago

ian-r-rose commented 2 months ago

The int_clearinghouse__active_stations and int_diagnostics__detector_status models are key models for determining which stations are deployed and considered reliable on any given day. Unfortunately, we have a few concerns about the data quality around for detectors (cf. some discussion in #217 ).

Once we have the VDS metadata from #115, we should revamp those models with the following constraints:

  1. No gaps in dates, no null values
  2. No gaps in stations, no null values
  3. No gaps in detectors/lanes, no null values
  4. No rows detectors/lanes that don't exist (i.e., the eighth lane in a highway that only has two lanes)

Once this is done, the int_clearinghouse__real_detectors model in #209 can be removed.

tnrahim commented 1 month ago

Starting with int_diagnostics__detector_status

create a auxillary table (CTE) that has detector(lane)/date detector is station lane combination

tnrahim commented 1 month ago

Finished reviewing issue and relevant datasets. Next Steps Use Date_spine (DBT UTIL) to create sample date column with all dates Use int_vdsdetector_config to generate Station_id/Detector_ID/District/Lane data Combine with int_diagnosticsdetector_status to restore missing records

tnrahim commented 1 month ago
tnrahim commented 1 month ago

I noticed there are many instance of TRANSFORM_PRD.VDS.INT_VDS__DETECTOR_CONFIG with a lane value of null. I was wondering if that is expected output? or if it something that is supposed to be resolved as part of this issue?

ian-r-rose commented 1 month ago

Interesting. What do those rows have for the status column? Could this be related to #273 ?

tnrahim commented 1 month ago

The status for the all of the lanes with value null is '0'. I'll review 273 and see if I can find out more. Thanks Ian!

ian-r-rose commented 1 month ago

It may be as simple as filtering out those detectors in the intermediate model. If so, we could fix #273 at the same time as this.

tnrahim commented 1 month ago

I reviewed TRANSFORM_PRD.VDS.INT_VDS__DETECTOR_CONFIG and discovered every single instance of status = 0 has a null lane value. Based on that I have filtered out status =0 when adding metadata from TRANSFORM_PRD.VDS.INT_VDSDETECTOR_CONFIG to int_diagnosticsdetector_status which resolved the issue I had will null lane values.

tnrahim commented 1 month ago

Discovered that int_diagnosticsdet_diag_set_assignment has stations that are not present in INT_VDSDETECTOR_CONFIG. When we looked yesterday there was about 300 missing stations. When I checked today there are 6.2K missing stations. Any idea what could have caused that change?

Query used

select distinct station_id from TRANSFORM_PRD.diagnostics.int_diagnosticsdet_diag_set_assignment minus select distinct station_id from TRANSFORM_PRD.VDS.INT_VDSDETECTOR_CONFIG;

Result set is attached. missing_stations.csv

jkarpen commented 1 month ago

Per @ian-r-rose the additional missing stations are likely due to ongoing issues with the metadata being brought over from the data relay server.

jkarpen commented 1 month ago

Reopening this issue as some additional follow-on work is needed.

tnrahim commented 1 month ago

Updates: updated detector_status to add missing sample data and remove nulls using pull request #318. Currently resolving CI errors prior to review and merging.