dgrtwo / fuzzyjoin

Join tables together on inexact matching
Other
664 stars 62 forks source link

In the newest version of fuzzyjoin, when joining data.tables, they lose the data.table attribute #75

Open emilBeBri opened 3 years ago

emilBeBri commented 3 years ago

In the new version of fuzzyjoin, joining data.tables makes them stop being data.tables.

Just updated to R 4.* and therefore alot of packages updated as well. In these new versions - I made a complete uninstall of my OS so don't know which versions it was - , joining data.tables with fuzzyjoin was suddenly a problem if later code relied on the data.table syntax.

reprex:

library(data.table)
library(fuzzyjoin)
a1 <- data.table(name=c('suzy', 'suxy', 'John', 'Janni', 'Tom'))
b1 <- data.table(name=c('suzzy', 'johnn', 'Jannice', 'Tom'))
c1 <- stringdist_inner_join(a1, b1, by = 'name', method='lv', max_dist=1, ignore_case=T, distance_col='fuzzy_dist')
is.data.table(c1)

you can easily recreate that with:

setDT(c1)
is.data.table(c1)

So it's easy to fix, but it broke some functions for matching i had made that relied on the data.table syntax after the stringdist_inner_join() was applied.

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_DK.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_DK.UTF-8        LC_COLLATE=en_DK.UTF-8    
 [5] LC_MONETARY=en_DK.UTF-8    LC_MESSAGES=en_DK.UTF-8   
 [7] LC_PAPER=en_DK.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_DK.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] fuzzyjoin_0.1.6   data.table_1.13.2

loaded via a namespace (and not attached):
 [1] stringdist_0.9.6.3 tidyr_1.1.2        crayon_1.3.4.9000  dplyr_1.0.2       
 [5] R6_2.4.1           lifecycle_0.2.0    magrittr_1.5       pillar_1.4.6      
 [9] stringi_1.5.3      rlang_0.4.8        vctrs_0.3.4        generics_0.0.2    
[13] ellipsis_0.3.1     tools_4.0.3        stringr_1.4.0      glue_1.4.2        
[17] purrr_0.3.4        parallel_4.0.3     compiler_4.0.3     pkgconfig_2.0.3   
[21] tidyselect_1.1.0   tibble_3.0.4