beniaminogreen / zoomerjoin

Superlatively-fast fuzzy-joins in R
https://beniamino.org/zoomerjoin/
GNU General Public License v3.0
103 stars 5 forks source link

Support `join_by()` #93

Closed etiennebacher closed 10 months ago

etiennebacher commented 10 months ago

Is your feature request related to a problem? Please describe. The join_by() helper function was introduced in dplyr 1.1.0. Given that functions in zoomerjoin are designed to be drop-in replacements of dplyr functions, it would be nice to support this syntax so that we don't have to manually change the syntax to a named vector.

Describe the solution you'd like Support for join_by() syntax, so that the example below works:

library(babynames)
library(zoomerjoin)
library(dplyr, warn.conflicts = FALSE)

baby_names <- data.frame(name = tolower(unique(babynames$name)))
baby_names_sans_vowels <- data.frame(
  name_wo_vowels =gsub("[aeiouy]","", baby_names$name)
)

# dplyr
joined_names <- inner_join(
  baby_names,
  baby_names_sans_vowels,
  by = join_by(name == name_wo_vowels)
)

# zoomerjoin
joined_names <- jaccard_inner_join(
  baby_names,
  baby_names_sans_vowels,
  by = join_by(name == name_wo_vowels)
)
#> Warning in jaccard_join(a, b, mode = "inner", by = by, salt_by = block_by, : A pair of records at the threshold (0.7) have only a 93% chance of being compared.
#> Please consider changing `n_bands` and `band_width`.
#> Error in simple_by_validate(a, b, by): by_a %in% names(a) are not all TRUE

Describe alternatives you've considered This is a minor feature request, only for convenience. Using a named vector works very well, it is simply superseded by the join_by() syntax.

Additional context /

Thanks for this amazing package, it's so nice to match names that fast

beniaminogreen commented 10 months ago

Thanks for flagging this! I was not aware of this dplyr feature, and I'll add it to my list of objectives for the package.

Best, Ben