Closed eroten closed 4 years ago
For imputing the 30sec interval volume and occupancy (after replacing impossible values with NA ), the method developed by @ashleyasmus using a rolling mean with center alignment using the two observations next to the given observation consistently preserves the overall distribution of each variable. There are still a few NA values due to the rolling average method, but there is a significant reduction in total NA values.
This is accomplished using data.table::shift()
and data.table::frollapply()
See dee3ece19fbb39d2daa4809761dce44f23156e1e
For imputing speed at 60min intervals, I'm going with the random forest method in {mice}
(for now). The formula used is speed ~ volume.sum + occupancy.sum + interval_bin + day_type
. The plots below show the imputed speed density and speed points by hour, day type, and imputation number. I ran a model with 25 imputations to get a better feel for it, but the vignette only runs 5 imputations.
EDIT: also relevant is the distribution of NA values across hour and day type
See commit e9ffdccd5a3a4304b28a4c8db4fa83f9cb007bb3
Volume and occupancy imputation by rolling average is a parameter in aggregate_sensor()
.
An example of speed imputation is using {mice}
is found in the "Calculate speed and delay" vignette.
Depending on the context and a given analysis's vulnerability to NA values, it may be appropriate to replace NA values for a given sensor or station with the posted speed limit (r_node_s_limit
in the sensor configuration table). This is explored further in the "Key congestion metrics" vignette (currently in dev). Take care when making this substitution and be clear on your reasoning.
My inclination is to use
{mice}
so we can use multivariate imputing.