cagov / caldata-mdsa-caltrans-pems

CalData's MDSA project with Caltrans on Performance Measurement System (PeMS) data
https://cagov.github.io/caldata-mdsa-caltrans-pems/
MIT License
7 stars 1 forks source link

Investigate `null` lane numbers in `int_diagnostics__detector_status` #217

Closed ian-r-rose closed 5 months ago

ian-r-rose commented 6 months ago

This table has a significant fraction of null lane numbers, but it shouldn't!

image
kengodleskidot commented 6 months ago

@ian-r-rose My understanding is that the null values are the result of bringing in stations that are active on a given date that do not have any raw data associated with them. The lane values come from the int_diagnostics__samples_per_station model which get them from the stg_clearinghousestation_raw model. Once we bring in the detector table, we can associate all detectors for station with their associated lane and eliminate these nulls. We would have used the stg_clearinghousestation_status model to bring in the detectors by lane but noticed that we were not getting all detectors for a station on any given date.

ian-r-rose commented 6 months ago

Hmm, so it it correct to say that all of these stations with null values for the lane would have a 'Down/No Data' status?

My concern with having so many nulls here is that I'd like to be able to use int_diagnostics__detector_status to filter data for all detectors that are "Good" for a given day. I suppose if all of the nulls represent detectors that are not good, we can work with it. But I think it would be nice to do some of the legwork to ensure that int_diagnostics__detector_status has a grain of detector to more easily be able to use it. As it is, this table has data integrity problems, with about half the rows not having a valid lane number.

junlee-analytica commented 6 months ago

This will require work on the Active Stations model.

kengodleskidot commented 6 months ago

Hmm, so it it correct to say that all of these stations with null values for the lane would have a 'Down/No Data' status?

My concern with having so many nulls here is that I'd like to be able to use int_diagnostics__detector_status to filter data for all detectors that are "Good" for a given day. I suppose if all of the nulls represent detectors that are not good, we can work with it. But I think it would be nice to do some of the legwork to ensure that int_diagnostics__detector_status has a grain of detector to more easily be able to use it. As it is, this table has data integrity problems, with about half the rows not having a valid lane number.

Yes, the stations and associated lanes would have a status of Down/No Data. I will add a where clause to reduce the number of lanes reporting status values where lanes do not exist (lane <= total lanes for a station). I will also look at incorporating the detector/lane data from the int_clearinghouse__station_status model to further reduce the number of null values. Once we have the configuration tables, we will also be able to incorporate controller data for additional statuses.

ian-r-rose commented 6 months ago

We should have controller data quite soon, so it may be worth waiting until then before implementing too many workarounds!

kengodleskidot commented 5 months ago

Based on review of the current detector_status model and the upcoming incorporation of the configuration data, the null values should be eliminated once the detector configuration data is joined with the detector_status.