feat: assess if fuzzyjoin may simplify/enhance the implementation of `match_name`

maurolepore commented 4 years ago

https://cran.r-project.org/web/packages/fuzzyjoin/

AB#10180

jdhoffa commented 4 years ago

Very cool.

cjyetman commented 10 months ago

A word of caution, faster is not always better. The first example in the docs for zoomerjoin by my estimation matches 1 correct, 7 incorrect, and the rest of the other 500 rows in each corpus are unmatched. To be fair, it's primarily failing on numbers that it likely does not see much difference in, but as a human they look obviously false.

Also to be fair, this is likely not worse than what is currently being done in this package. But it's likely not much better either, even if it's faster.

library(tidyverse)
library(zoomerjoin)
options(width = 130)

corpus_1 <- dime_data %>% # dime data is packaged with zoomerjoin
  head(500)
names(corpus_1) <- c("a", "field")

corpus_2 <- dime_data %>% # dime data is packaged with zoomerjoin
  tail(500)
names(corpus_2) <- c("b", "field")

jaccard_inner_join(corpus_1, corpus_2,
  by = "field", n_gram_width = 6,
  n_bands = 20, band_width = 6, threshold = .8
)
#> # A tibble: 8 × 4
#>       a field.x                                                      b field.y                                                 
#>   <dbl> <chr>                                                    <dbl> <chr>                                                   
#> 1   302 americans for good government inc                          910 americans for good government                           
#> 2   230 pipefitters local union 524                                998 pipefitters local union 533                             
#> 3   292 bill bradley for u s senate '84                            913 bill bradley for u s senate '90                         
#> 4   378 guarini for congress 1982                                  606 guarini for congress 1984                               
#> 5   378 guarini for congress 1982                                  883 guarini for congress 1986                               
#> 6   238 4th congressional district democratic party                518 16th congressional district democratic party            
#> 7    88 scheuer for congress 1980                                  667 scheuer for congress 1984                               
#> 8   319 7th congressional district democratic party of wisconsin   792 8th congressional district democratic party of wisconsin

jdhoffa commented 10 months ago

Fair enough.

And code speed isn't really the main blocker with this package, it's "time it takes to manually verify"

Still neat to look into!

RMI-PACTA / r2dii.match

feat: assess if fuzzyjoin may simplify/enhance the implementation of `match_name` #302