EdwinTh / padr

Padding of missing records in time series
https://edwinth.github.io/padr/
Other
132 stars 12 forks source link

Issue with thicken on variables with multiple missing values #66

Closed ghost closed 5 years ago

ghost commented 5 years ago

There seems to be an issue with thicken on variables with at least two missing values. Here is a little reprex to show the problem:

## create dataframe
data <- data.frame(date = seq(as.Date("2000-01-01"), as.Date("2000-04-30"), "weeks"))

## indices of missing observations
na_index <- c(2, 5, 9, 14)

## convert observations to missing
data$date[na_index] <- NA

## the data
data
#>          date
#> 1  2000-01-01
#> 2        <NA>
#> 3  2000-01-15
#> 4  2000-01-22
#> 5        <NA>
#> 6  2000-02-05
#> 7  2000-02-12
#> 8  2000-02-19
#> 9        <NA>
#> 10 2000-03-04
#> 11 2000-03-11
#> 12 2000-03-18
#> 13 2000-03-25
#> 14       <NA>
#> 15 2000-04-08
#> 16 2000-04-15
#> 17 2000-04-22
#> 18 2000-04-29

Created on 2019-04-14 by the reprex package (v0.2.1)

Now, if we call thicken on this data, then the NA values in the new column will be shifted by one, two, etc. for the second, third, etc. missing values of the original column:

## run thicken
padr::thicken(x = data, by = "date", interval = "month")
#> Warning: There are NA values in the column date.
#> Returned dataframe contains original observations, with NA values for date and date_month.
#>          date date_month
#> 1  2000-01-01 2000-01-01
#> 2        <NA>       <NA>
#> 3  2000-01-15 2000-01-01
#> 4  2000-01-22 2000-01-01
#> 5        <NA> 2000-02-01
#> 6  2000-02-05       <NA>
#> 7  2000-02-12 2000-02-01
#> 8  2000-02-19 2000-02-01
#> 9        <NA> 2000-03-01
#> 10 2000-03-04 2000-03-01
#> 11 2000-03-11       <NA>
#> 12 2000-03-18 2000-03-01
#> 13 2000-03-25 2000-03-01
#> 14       <NA> 2000-04-01
#> 15 2000-04-08 2000-04-01
#> 16 2000-04-15 2000-04-01
#> 17 2000-04-22       <NA>
#> 18 2000-04-29 2000-04-01

Created on 2019-04-14 by the reprex package (v0.2.1)

I think the issue comes from the following line of the add_na_to_thicken function: https://github.com/EdwinTh/padr/blob/30654549062815aa264e58e2daa16541d9246c5e/R/thicken.R#L238

For example, if we take the second missing value, then the indices of the non-missing values after the second missing value will be shifted by at least two (in the example above, 2000-02-05 will have index 4 in the thickened vector, while the second missing value has index 5), hence we have two subtract 1.5 from the index of the second missing value, not just 0.5. Similary for the third, fourth, etc. missing value, we have to subtract 2.5, 3.5, etc. from the original index of the missing value.

This suggests that the following quick and dirty modification might work (and indeed, it did work for me), however this might not be the best solution.

return_ind <- c(seq_along(thickened), na_ind - (0.5 + (seq(na_ind) - 1)))
EdwinTh commented 5 years ago

Thank you for you for your effort in not only reporting the bug, but also looking for its cause. Much appreciated! I can confirm reproduction of the bug and will provide a fix in the upcoming release.

ghost commented 5 years ago

Thank you for the great package, makes my life a lot easier! I am glad I could help!