EdwinTh / padr

Padding of missing records in time series
https://edwinth.github.io/padr/
Other
132 stars 12 forks source link

padr won't work with dplyr `group_by()` unless column names follow base R syntax #69

Closed ghost closed 3 years ago

ghost commented 4 years ago
library(dplyr)
library(padr)

df1 <- data.frame(`col 1` = c("A", "A", "B", "B"),
                  `col 2` = 1:4,
                  `col 3` = as.Date(c("2019-01-01", "2019-01-31",
                                   "2019-02-01", "2019-02-28")))

df2 <- data.frame(col1 = c("A", "A", "B", "B"),
                  col2 = 1:4,
                  col3 = as.Date(c("2019-01-01", "2019-01-31",
                                   "2019-02-01", "2019-02-28")))

I can use the group argument of the pad() function without issue. However, if I use dplyr::group_by() I get the error seen below:

df1 %>% group_by("col 1") %>% pad()
#> Error: Not all grouping variables are column names of x.

This only seems to happen when the column names are not following base R normal syntax. If I "fix" the column names everything works as expected, as shown below:

df2 %>% group_by(col1) %>% pad()
#> pad applied on the interval: day
#> # A tibble: 59 x 3
#> # Groups:   col1 [2]
#>    col1   col2 col3      
#>    <fct> <int> <date>    
#>  1 A         1 2019-01-01
#>  2 A        NA 2019-01-02
#>  3 A        NA 2019-01-03
#>  4 A        NA 2019-01-04
#>  5 A        NA 2019-01-05
#>  6 A        NA 2019-01-06
#>  7 A        NA 2019-01-07
#>  8 A        NA 2019-01-08
#>  9 A        NA 2019-01-09
#> 10 A        NA 2019-01-10
#> # … with 49 more rows

I suspect this is a bug. I hope this helps.

EdwinTh commented 4 years ago

The column name is not specified correctly in the group_by function, it uses regular quotes instead of backticks (look at the output of df1 %>% group_by("col 1")).

If you want my advise, never use improper column names that require backticks because they will cause you all kinds of trouble. Fix them before analysing.

ghost commented 4 years ago

@EdwinTh even if I create a tibble and wrap the column names in back ticks I still get the same error. At this point the only solution may be to eliminate spaces from column names as you recommended. Please consider below though, this still seems like an issue even after I correct with proper back ticks.

Create the tibble

library(dplyr)
library(padr)
df <- tibble(`col 1` = c("A", "A", "B", "B"),
             `col 2` = 1:4,
             `col 3` = as.Date(c("2019-01-01", "2019-01-31",
                                 "2019-02-01", "2019-02-28")))
#> # A tibble: 4 x 3
#>   `col 1` `col 2` `col 3`   
#>   <chr>     <int> <date>    
#> 1 A             1 2019-01-01
#> 2 A             2 2019-01-31
#> 3 B             3 2019-02-01
#> 4 B             4 2019-02-28

Column names preserved after group_by()

df %>% group_by(`col 1`)
#> # A tibble: 4 x 3
#> # Groups:   col 1 [2]
#>   `col 1` `col 2` `col 3`   
#>   <chr>     <int> <date>    
#> 1 A             1 2019-01-01
#> 2 A             2 2019-01-31
#> 3 B             3 2019-02-01
#> 4 B             4 2019-02-28

Error still produced when pad()ing the tibble

df %>% group_by(`col 1`) %>% pad()
#> Error: Not all grouping variables are column names of x.

yet everything works when I use padr's group argument (quoted col names)

df %>% pad(group = "col 1")
#> pad applied on the interval: day
#> # A tibble: 59 x 3
#>    `col 1` `col 2` `col 3`   
#>    <chr>     <int> <date>    
#>  1 A             1 2019-01-01
#>  2 A            NA 2019-01-02
#>  3 A            NA 2019-01-03
#>  4 A            NA 2019-01-04
#>  5 A            NA 2019-01-05
#>  6 A            NA 2019-01-06
#>  7 A            NA 2019-01-07
#>  8 A            NA 2019-01-08
#>  9 A            NA 2019-01-09
#> 10 A            NA 2019-01-10
#> # … with 49 more rows
EdwinTh commented 4 years ago

Ah that is surprising, I think it has something to do with character conversion of the groups. I will look into it, thanks.