dgrtwo / fuzzyjoin

Join tables together on inexact matching
Other
668 stars 61 forks source link

Errors when trying to fuzzyjoin #35

Open jessiejyen opened 7 years ago

jessiejyen commented 7 years ago

Hello! Please excuse any obvious or dumb errors - I am new to R and tried looking up the answers beforehand, I promise!

I am trying to join two tables using the R "fuzzyjoin" package. I am trying to fuzzy join two tables of company names, exactly. I have one data frame of 5000 company names, and one data frame of 1600 company names. There are other no columns besides the company names.

Using the package, I have: NewTable <- AccountsList1 %>% stringdist_inner_join(AccounttList2, by = NULL)

However, I got two errors: Joining by: "Accounts" Error in dists[include] <- stringdist::stringdist(v1[include], v2[include], : NAs are not allowed in subscripted assignments and 50: In stri_length(string) : invalid UTF-8 byte sequence detected. perhaps you should try calling stri_enc_toutf8()

So then I removed N/As via [!is.na(AccountsList1)] and forced UTF-8 via stri_enc_toutf8(AccountsList1, is_unknown_8bit = FALSE, validate = FALSE)

However, when I rerun I get the exact same errors... Does anyone have any advice? Thank you!

david-jankoski commented 7 years ago

if it is of any help - it seems like the first error stems from the stringdist and the second one from the stringi package so not really an issue from this package.
I see you tried to remedy your AccountsList1 but the problem might still be in the - AccountsList2 have you tried the same there?

TiagoVentura commented 6 years ago

Hi,

I am getting the same mistake, but I do not have any NA value in two tibbles I am using as inputs in the function. Did you solve it?

david-jankoski commented 6 years ago

i can try and help to see what's going on, if someone could post a minimal reproducible example to work with.

statsccpr commented 6 years ago

I hit this error today,

i'm pretty sure it's some kind of out-dated dependent packages

I tried it on two platforms,

This works (did not hit error)

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics 
[3] grDevices utils    
[5] datasets  methods  
[7] base     

other attached packages:
[1] fuzzyjoin_0.1.3

loaded via a namespace (and not attached):
 [1] compiler_3.4.1  
 [2] magrittr_1.5    
 [3] assertthat_0.2.0
 [4] R6_2.2.2        
 [5] tools_3.4.1     
 [6] bindrcpp_0.2    
 [7] glue_1.1.1      
 [8] dplyr_0.7.4     
 [9] tibble_1.3.3    
[10] yaml_2.1.16     
[11] Rcpp_0.12.14    
[12] pkgconfig_2.0.1 
[13] rlang_0.1.4     
[14] bindr_0.1  

This does not work (hits error)

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS release 6.7 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] fuzzyjoin_0.1.4 here_0.1        dplyr_0.7.4     datzen_0.1.0   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.13     rprojroot_1.2    assertthat_0.2.0 R6_2.2.2         backports_1.1.1 
 [6] magrittr_1.5     rlang_0.1.2      bindrcpp_0.2     tools_3.2.3      glue_1.1.1      
[11] rsconnect_0.8    pkgconfig_2.0.1  bindr_0.1        tibble_1.3.4    
prokopyev commented 6 years ago

I am hitting the same error on large lists (small examples work fine):

Error in dists[include] <- stringdist::stringdist(v1[include], v2[include], : NAs are not allowed in subscripted assignments

[UPDATE]

Some exploration lead to a discovery of NAs in one of the arguments passed to stringdist_inner_join. Can be located by which(is.na(v2)), and replaced with df_cons[is.na(v2)] <- "empty_string" .

After this, the error disappears.

vikrantkakad commented 6 years ago

@statsccpr I also faced a similar issue with one large dataset, and it works for me on Windows PC and didn't work on Mac OSX. I tried 2 different computers Linux and OSX with the exact same R and RStudio versions and package versions.

Error in dists[include] <- stringdist::stringdist(v1[include], v2[include], :
NAs are not allowed in subscripted assignments

Later I changed the encoding to "UTF-8" of the Excel file I was using and reloaded it in R Dataframe and it worked this time. Seems like OS encoding issue how Excel/CSVs are treated by Windows and OSX/Linux are different.