EdwinTh / padr

Padding of missing records in time series
https://edwinth.github.io/padr/
Other
132 stars 12 forks source link

Value interpolation instead of rounding #62

Closed danielsjf closed 5 years ago

danielsjf commented 5 years ago

The rounding of thicken works very well for volumetric items. For instance, the number of departed flights in an hour can be easily rounded down to represent the number of flights departed in this hour.

However, for continuous measurements like temperature, a timewise allocation to the closest hour seems more appropriate.

Say we have the following data.

library(dplyr)
temp <- tibble(date = as.POSIXct(c('2018-11-20 9:02:01', '2018-11-20 9:32:33', 
                                   '2018-11-20 9:45:20', '2018-11-20 10:14:40', 
                                   '2018-11-20 10:51:22'), 
                                 tz = 'UTC'),
               temp = c(18,17,16,15,14))
temp %>% thicken('hour')

The most optimal estimate for the temperature on exactly the hours is the closest value for the first and the last, but the linear interpolation for the other hours. The expected result would be:

temp <- tibble(date = as.POSIXct(c('2018-11-20 9:00:00', '2018-11-20 10:00:00', 
                                   '2018-11-20 11:00:00'), 
                                 tz = 'UTC'),
               temp = c(18,15.5,14))

In this case, the 15.5 was calculated as the weighted average between the 15 and the 16.

I don't know if this completely fits under thicken, but I do think it would open a lot of other use cases for the package.

EdwinTh commented 5 years ago

Thanks for your input. Everything in padr is around the concept of the interval, which this is not I am afraid, I don't think this is a case to be implemented into padr. Although I think you can use thicken to make a wrapper that does the thing you are after.