dgrtwo / fuzzyjoin

Join tables together on inexact matching
Other
664 stars 62 forks source link

fuzzy join based on similarity instead of distance #71

Open fangzhou-xie opened 4 years ago

fangzhou-xie commented 4 years ago

Hi! Thanks for this wonderful package.

I am interested in matched two columns by similarity score and I read from the README that there is only stringdist_* family of functions provided. I wonder if there is a way for me to use join functions based on stringsim?

Thanks a lot!

fangzhou-xie commented 4 years ago

It seems that, in method = 'jw' case, if I set max_dist = 0.1, that is equivalent to setting a similarity threshold of 0.9. I wonder if such a shortcut/workaround is available to other distance functions as well?

(BTW, the default max_dist = 2 under method = 'jw' seems to always match.)

JBGruber commented 3 years ago

I thought this was a pretty good idea and implemented the function(s). Not sure what @dgrtwo will think of it but it was a nice practice. This is how it works:

library(dplyr)
library(fuzzyjoin)

a <- tibble(id = 1, text = "Lorem ipsum dolor sit")
b <- tibble(id = 2, text = "Lorem ipsum dolor sit amet")

stringdist::stringsim(a$text[1], b$text[1], method = "soundex")
#> [1] 1

a %>% 
  stringsim_left_join(b, by = "text", similarity_col = "sim", min_sim = 0.8)
#> # A tibble: 1 x 5
#>    id.x text.x                 id.y text.y                       sim
#>   <dbl> <chr>                 <dbl> <chr>                      <dbl>
#> 1     1 Lorem ipsum dolor sit     2 Lorem ipsum dolor sit amet 0.808

You can test it from my repo (remotes::install_github("JBGruber/fuzzyjoin")).

fangzhou-xie commented 3 years ago

@JBGruber Thanks a lot! I tried it out a bit and it seems that your implementation works fine. I am not sure what @dgrtwo would think but I personally like it!

Maybe you can try to send a PR and see whether they would like to merge it into the main branch?

JBGruber commented 3 years ago

I already created the PR but haven't got a reply yet: https://github.com/dgrtwo/fuzzyjoin/pull/74