carissalow / rapids

Reproducible Analysis Pipeline for Data Streams
http://www.rapids.science/
GNU Affero General Public License v3.0
37 stars 20 forks source link

Fix bug in process_location_types.R script #223

Closed jenniferfedor closed 1 year ago

jenniferfedor commented 1 year ago

We resample a row of location data forward in time into the next minute bin until 1 ms before the next sensed location timestamp or the timestamp corresponding to last sensed timestamp plus the consecutive threshold buffer is reached, whichever comes first. This can result in time differences <1 minute (60000 ms), and as small as a few ms, between a final resampled row and the subsequent sensed location row:

resample_group limit timestamp provider id diff_bw_curr_and_next_row_ms
830 1660842237896 1660842117894 fused 0 60000
830 1660842237896 1660842177894 resampled 1 60000
830 1660842237896 1660842237894 resampled 2 3
831 1660842357892 1660842237897 fused 0 60000
831 1660842357892 1660842297897 resampled 1 59996
832 1660842477893 1660842357893 fused 0 60000
832 1660842477893 1660842417893 resampled 1 60001


The inclusion of such rows in the processed locations data can result in unexpected negative values for features like varspeed (which should always be non-negative) in processed data from the PHONE_LOCATIONS DORYAB provider downstream.

We therefore add a condition to drop rows from the processed locations data when the provider is resampled and the difference between that resampled row's timestamp and the next (leading) timestamp is <60000 ms:

resample_group limit timestamp provider id diff_bw_curr_and_next_row_ms
830 1660842237896 1660842117894 fused 0 60000
830 1660842237896 1660842177894 resampled 1 60003
831 1660842357892 1660842237897 fused 0 119996
832 1660842477893 1660842357893 fused 0 60000
832 1660842477893 1660842417893 resampled 1 60001


Note that this change still allows for the time difference between two sensed location timestamps to be <60000 ms. It only ensures that the time difference between a resampled timestamp and subsequent sensed location timestamp will be $\ge$ 60000 ms. The time difference between a sensed location timestamp and subsequent resampled timestamp or between two consecutive resampled timestamps is always exactly 60000 ms.