EdwinTh / padr

Padding of missing records in time series
https://edwinth.github.io/padr/
Other
132 stars 12 forks source link

group argument not working for data frames with multiple classes (tibbles in particular) #74

Closed ghost closed 4 years ago

ghost commented 4 years ago
df
#> # A tibble: 5,856 x 5
#>    Scar_Id      Code     Type         Value      YrMo      
#>    <chr>        <chr>    <chr>        <date>     <date>    
#>  1 0070-179     NA       Start_Date   2020-04-22 2020-04-01
#>  2 0070-179     NA       Closure_Date 2020-05-23 2020-05-01
#>  3 1139-179     NA       Start_Date   2020-04-23 2020-04-01
#>  4 1139-179     NA       Closure_Date 2020-05-23 2020-05-01
#>  5 262-179      NA       Start_Date   2019-08-29 2019-08-01
#>  6 262-179      NA       Closure_Date 2020-05-23 2020-05-01
#>  7 270-179      BB       Start_Date   2019-08-29 2019-08-01
#>  8 270-179      BB       Closure_Date 2020-05-23 2020-05-01
#>  9 476-179      XX       Start_Date   2019-09-04 2019-09-01
#> 10 476-179      XX       Closure_Date 2019-11-04 2019-11-01
#> # ... with 5,846 more rows

My df data frame above has 5,846 rows and is composed of 2,923 groups. I can run the following pipe on df and create new rows padding the YrMo month.

df %>% group_by(Scar_Id) %>% do(pad(., by = "YrMo", interval = "month"))

However, if I try to do the same thing using padr's group argument two strange things happen:

df %>% pad(group = "Scar_Id", by = "YrMo", interval = "month")
...
#> Warning message: datetime variable does not vary for 537 of the groups,
#> no padding applied on this / these group(s)
  1. All Scar_Id groups with Code "NA" get padded correctly (eg rows 5 through 6 above).
  2. All groups with a Code that is a character string, in other words - groups that are not "NA", do not get padded at all, and I get the warning message shown above (eg rows 7 through 10 above).

I think this is somehow related to the size of my data frame and the number of groups it contains. If I slice the data frame into smaller parts the group argument starts behaving as I expect it to, "NA" Codes get padded, and character string Codes get padded:

# This works
df %>% slice(1:100) %>% pad(group = "Scar_Id", by = "YrMo", interval = "month")
# and this works too
df %>% slice(200:500) %>% pad(group = "Scar_Id", by = "YrMo", interval = "month")

I would include the complete data frame, but it is large, and contains sensitive information. I should also mention that when I do run a dput() on my data frame the class is as follows... which seems odd because there's multiple classes. I'm used to seeing one. Maybe this is a contributing factor?

df %>% dput()
#> ...
#> class = c("tbl_df", "tbl", "data.frame")
#> ...

Update: Yup, if I change the df data class to a singular "data.frame" via "class<-"("data.frame") the group() argument of the pad function works as expected.

EdwinTh commented 4 years ago

Hi Jason, thanks for your question. Unfortunately, it is not possible for me to reproduce in this way, it would be very helpful if you could draw up a reprex so we can investigate it further.

From your example, at least I can tell that:

  1. This warning is expected if your grouping level does contain only one record (there is no start and end then)
  2. The value of the Code should not matter, because it is nowhere used in the grouping. It cannot affect the padding.
  3. There is no max on the number of groups, pretty sure this is not the culprit.
  4. The multiple classes are expected for a tibble (it inherits from a regular data.frame and adds classes to it).
ghost commented 4 years ago

Hi Edwin. I tried to create a repro to demonstrate this behavior (see below) but it works fine! Perhaps we close the issue, and I'll post more information if it becomes available. Thank you for the consideration.

library(tidyverse)
library(lubridate)
library(padr)

df.tib <- 
  tibble(col1 = rep(paste0(as.character(1:3000), "-", "A"), 2)) %>% 
  arrange(col1) %>% 
  mutate(col2 = c(rep(NA, 2000), rep(c("YY", "YY", "ZZ", "ZZ"), 1000)),
         col3 = rep(c("Start", "End"), 3000),
         col4 = as.Date(1:6000, origin = "2017-01-01"),
         col5 = as.Date(rep(c("2019-01-01", "2019-03-01"), 3000))) %>% 
  pad(group = "col1", by = "col5", interval = "month")

df.tib
df.tib %>% tail(20)
EdwinTh commented 4 years ago

Ok, I suspect it is a peculiarity of your data set. Please feel free to reopen if you still think some things don't work as they should.

EdwinTh commented 4 years ago

One more thing, are you on the latest version of padr (0.5.1). Due to an update of the tibble package I had to do patch release a few weeks ago, this might cause the problem. Please make sure you are on the lates version of padr.

ghost commented 4 years ago

Yes, I am on padr version 0.5.1 and I also suspect it's a peculiarity of my data set, which unfortunately I can't publish. Will keep this thread posted on any changes or updates. Thanks