dgrtwo / fuzzyjoin

Join tables together on inexact matching
Other
670 stars 61 forks source link

Optimal joining? #18

Open Deleetdk opened 8 years ago

Deleetdk commented 8 years ago

I like the idea of this package, but it does not work well in practice for my needs. I wrote a knitr explaining the problem here.

In brief: often the task is to match up two n-length vectors of strings against each other 1-to-1. The present join algorithm does not enforce 1-to-1 joins, i.e. sometimes one string gets joined to two others, and sometimes one gets joined to none.

dgrtwo commented 8 years ago

The general approach I'd use is to join the two using a high (possibly infinite) distance, to use the distance_col argument to add a distance column, and then to pick the best match in each group (which I'd do with group_by and top_n from dplyr, but there are alternatives).

EN = c("Denmark", "Norway", "USA", "Russia", "Germany")
DA = c("Danmark", "Norge", "USA", "Rusland", "Tyskland")

library(dplyr)
library(fuzzyjoin)

en <- data_frame(EN)
da <- data_frame(DA)

en %>%
  stringdist_inner_join(da, by = c(EN = "DA"), max_dist = Inf, distance_col = "distance") %>%
  group_by(EN) %>%
  top_n(1, -distance)

I do see that this would still allow 1-to-many matches, though. You're looking for a global minimum of distance?

Deleetdk commented 8 years ago

I expanded the knitr to include my (experimental!) implementation of the proposed function:

http://rpubs.com/EmilOWK/209456

It enforces 1-to-1 joining. In some cases, this will result in incorrect joins. E.g. if there are two pairs of cases with wildly different names, they may end up being paired up in the opposite way. For relatively small datasets, this should be rare, but it remains to be tested in the wild.

However, I use another method of dealing with names of political units: translate names to ISO names -- using fuzzy matching if necessary -- then join as normal. This knitr showcases that kind of solution.

Your call above not only violates 1-to-1, it gets the results wrong twice because it matches Germany with Danmark and Rusland, and not with Tyskland, the correct match.

ahcyip commented 7 years ago

group_by and top_n works for me! Thanks.

prokopyev commented 6 years ago

This gist offers a solution with its best_only parameter:

https://gist.github.com/gdmcdonald/bacfaafe2cccff18b6a81b319cbc3580

moh-salah commented 5 years ago

This gist offers a solution with its best_only parameter:

https://gist.github.com/gdmcdonald/bacfaafe2cccff18b6a81b319cbc3580

Nice. It would be useful to make it work with pipes and allow joining on more than one column.