dgrtwo / fuzzyjoin

Join tables together on inexact matching
Other
664 stars 62 forks source link

Error: All columns in a tibble must be vectors. x Column `col` is NULL. #78

Open marcelbaumgartner opened 3 years ago

marcelbaumgartner commented 3 years ago

Hello,

your package works fine, but with some recent data I have again this error that seems to have been corrected. I cannot share my two tibbles that I use for confidentiality reasons. What could cause this error? When I backtrace the error, I see this:

Backtrace:
     x
  1. +-fuzzyjoin::fuzzy_left_join(...)
  2. | \-fuzzyjoin::fuzzy_join(x, y, by, match_fun, mode = "left", ...)
  3. |   +-dplyr::bind_rows(...)
  4. |   | \-rlang::list2(...)
  5. |   \-base::lapply(...)
  6. |     \-fuzzyjoin:::FUN(X[[i]], ...)
  7. |       \-`%>%`(...)
  8. +-dplyr::mutate(., indices = purrr::map(data, "indices"))
  9. +-tidyr::nest(.)
 10. +-dplyr::group_by(., col)
 11. \-dplyr:::group_by.data.frame(., col)
 12.   \-dplyr::grouped_df(groups$data, groups$group_names, .drop)
 13.     \-dplyr:::compute_groups(data, vars, drop = drop)
 14.       +-tibble::as_tibble(data)
 15.       \-tibble:::as_tibble.data.frame(data)
 16.         \-tibble:::lst_to_tibble(unclass(x), .rows, .name_repair)
 17.           \-tibble:::check_valid_cols(x)

The code looks like this:

FBL5N <- fuzzy_left_join(
  x = FBL5N,
  y = V_LD_for_join,
  by = c("ACCOUNT" = "Customer",
         "NGA_PSTNG_DATE" = "NGA_VALID_FROM",
         "NGA_PSTNG_DATE" = "NGA_VALID_TO"),
  match_fun = list(`==`, `>=`, `<=`))

Any idea what could be wrong?

Sessioninfo:

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] tibble_3.1.0     stringi_1.5.3    tidyr_1.1.3      stringr_1.4.0   
[5] lubridate_1.7.10 dplyr_1.0.5      fuzzyjoin_0.1.6 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        rstudioapi_0.13   magrittr_2.0.1   
 [4] tidyselect_1.1.0  R6_2.5.0          rlang_0.4.10     
 [7] fansi_0.4.2       tools_4.0.2       data.table_1.14.0
[10] utf8_1.1.4        cli_2.3.1         DBI_1.1.1        
[13] ellipsis_0.3.1    assertthat_0.2.1  lifecycle_1.0.0  
[16] crayon_1.4.1      purrr_0.3.4       vctrs_0.3.6      
[19] glue_1.4.2        compiler_4.0.2    pillar_1.5.1     
[22] generics_0.1.0    pkgconfig_2.0.3  
dgrtwo commented 3 years ago

Thanks for your report!

I'm afraid I haven't seen this error before, and I can't make any progress here without a reproducible example. Would you be able to see if you can reproduce it with something like this, that takes only two rows and a handful of columns?

FBL5N <- fuzzy_left_join(
  x = FBL5N %>% head(2) %>% select(ACCOUNT, NGA_PSTNG_DATE),
  y = V_LD_for_join %>% head(2) %>% select(Customer, NGA_VALID_FROM, NGA_VALID_TO),
  by = c("ACCOUNT" = "Customer",
         "NGA_PSTNG_DATE" = "NGA_VALID_FROM",
         "NGA_PSTNG_DATE" = "NGA_VALID_TO"),
  match_fun = list(`==`, `>=`, `<=`))

If that does reproduce the error, then perhaps you can take just those selected versions of the tables, anonymize them enough (e.g. changing the Customer name), and then posting it as a reproducible version?

marcelbaumgartner commented 3 years ago

I am so sorry. I have made a terrible mistake. The lookup column was called "CUSTOMER" and not "Customer". So now all works perfectly fine. Small suggestion: the error is very cryptic when you do this error. Maybe you could add something like "the column Customer does not exist in table y". But really it is all on me. Thanks for your prompt response!