Metropolitan-Council / tc.sensors

Package with functions to pull sensor data, sensor IDs, and sensor configuration for MnDOT metro district
https://metropolitan-council.github.io/tc.sensors
Other
1 stars 0 forks source link

Impute missing volume, occupancy, and speed values #10

Closed eroten closed 4 years ago

eroten commented 4 years ago

My inclination is to use {mice} so we can use multivariate imputing.

The mice package implements a method to deal with missing data. The package creates multiple imputations (replacement values) for multivariate missing data. The method is based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model. The MICE algorithm can impute mixes of continuous, binary, unordered categorical and ordered categorical data. In addition, MICE can impute continuous two-level data, and maintain consistency between imputations by means of passive imputation. Many diagnostic plots are implemented to inspect the quality of the imputations.

eroten commented 4 years ago

For imputing the 30sec interval volume and occupancy (after replacing impossible values with NA ), the method developed by @ashleyasmus using a rolling mean with center alignment using the two observations next to the given observation consistently preserves the overall distribution of each variable. There are still a few NA values due to the rolling average method, but there is a significant reduction in total NA values.

This is accomplished using data.table::shift() and data.table::frollapply()

See dee3ece19fbb39d2daa4809761dce44f23156e1e

eroten commented 4 years ago

For imputing speed at 60min intervals, I'm going with the random forest method in {mice} (for now). The formula used is speed ~ volume.sum + occupancy.sum + interval_bin + day_type. The plots below show the imputed speed density and speed points by hour, day type, and imputation number. I ran a model with 25 imputations to get a better feel for it, but the vignette only runs 5 imputations.

imputed_speed_density

imputed_speed_obs

EDIT: also relevant is the distribution of NA values across hour and day type

na_speed_percentages

See commit e9ffdccd5a3a4304b28a4c8db4fa83f9cb007bb3

eroten commented 4 years ago

Volume and occupancy imputation by rolling average is a parameter in aggregate_sensor(). An example of speed imputation is using {mice} is found in the "Calculate speed and delay" vignette.

Depending on the context and a given analysis's vulnerability to NA values, it may be appropriate to replace NA values for a given sensor or station with the posted speed limit (r_node_s_limit in the sensor configuration table). This is explored further in the "Key congestion metrics" vignette (currently in dev). Take care when making this substitution and be clear on your reasoning.