dgrtwo / fuzzyjoin

Join tables together on inexact matching
Other
664 stars 62 forks source link

Naming `distance_col` when matching along multiple variables #84

Open spspitze opened 2 years ago

spspitze commented 2 years ago

I'm experimenting with matching along n variables (ex x1 and x2) and want to keep track of the distance for each variable (distance_col = "distance"). You can do this, but the data frame creates n + 1 variables, a distance measure for each variable with the corresponding prefix (x1.distance) and an original distance measure distance that is only NA's. It would be nice if this were dropped automatically.

library(tidyverse)
library(fuzzyjoin)

ex_1 <- tibble(
  x1 = c("how", "now", "brown", "cow"),
  x2 = c("what", "do", "I", "know")
)

ex_2 <- tibble(
  x1 = c("hw", "nw", "brwn", "cw"),
  x2 = c("wht", "d", "I", "knw")
)

stringdist_inner_join(ex_1, ex_2, by = c("x1", "x2"),
                      method = "lv",
                      distance_col = "distance")