Open maurolepore opened 4 years ago
Very cool.
A word of caution, faster is not always better. The first example in the docs for zoomerjoin
by my estimation matches 1 correct, 7 incorrect, and the rest of the other 500 rows in each corpus are unmatched. To be fair, it's primarily failing on numbers that it likely does not see much difference in, but as a human they look obviously false.
Also to be fair, this is likely not worse than what is currently being done in this package. But it's likely not much better either, even if it's faster.
library(tidyverse)
library(zoomerjoin)
options(width = 130)
corpus_1 <- dime_data %>% # dime data is packaged with zoomerjoin
head(500)
names(corpus_1) <- c("a", "field")
corpus_2 <- dime_data %>% # dime data is packaged with zoomerjoin
tail(500)
names(corpus_2) <- c("b", "field")
jaccard_inner_join(corpus_1, corpus_2,
by = "field", n_gram_width = 6,
n_bands = 20, band_width = 6, threshold = .8
)
#> # A tibble: 8 × 4
#> a field.x b field.y
#> <dbl> <chr> <dbl> <chr>
#> 1 302 americans for good government inc 910 americans for good government
#> 2 230 pipefitters local union 524 998 pipefitters local union 533
#> 3 292 bill bradley for u s senate '84 913 bill bradley for u s senate '90
#> 4 378 guarini for congress 1982 606 guarini for congress 1984
#> 5 378 guarini for congress 1982 883 guarini for congress 1986
#> 6 238 4th congressional district democratic party 518 16th congressional district democratic party
#> 7 88 scheuer for congress 1980 667 scheuer for congress 1984
#> 8 319 7th congressional district democratic party of wisconsin 792 8th congressional district democratic party of wisconsin
Fair enough.
And code speed isn't really the main blocker with this package, it's "time it takes to manually verify"
Still neat to look into!
https://cran.r-project.org/web/packages/fuzzyjoin/
AB#10180