Open Deleetdk opened 8 years ago
The general approach I'd use is to join the two using a high (possibly infinite) distance, to use the distance_col
argument to add a distance column, and then to pick the best match in each group (which I'd do with group_by
and top_n
from dplyr, but there are alternatives).
EN = c("Denmark", "Norway", "USA", "Russia", "Germany")
DA = c("Danmark", "Norge", "USA", "Rusland", "Tyskland")
library(dplyr)
library(fuzzyjoin)
en <- data_frame(EN)
da <- data_frame(DA)
en %>%
stringdist_inner_join(da, by = c(EN = "DA"), max_dist = Inf, distance_col = "distance") %>%
group_by(EN) %>%
top_n(1, -distance)
I do see that this would still allow 1-to-many matches, though. You're looking for a global minimum of distance?
I expanded the knitr to include my (experimental!) implementation of the proposed function:
http://rpubs.com/EmilOWK/209456
It enforces 1-to-1 joining. In some cases, this will result in incorrect joins. E.g. if there are two pairs of cases with wildly different names, they may end up being paired up in the opposite way. For relatively small datasets, this should be rare, but it remains to be tested in the wild.
However, I use another method of dealing with names of political units: translate names to ISO names -- using fuzzy matching if necessary -- then join as normal. This knitr showcases that kind of solution.
Your call above not only violates 1-to-1, it gets the results wrong twice because it matches Germany
with Danmark
and Rusland
, and not with Tyskland
, the correct match.
group_by and top_n works for me! Thanks.
This gist offers a solution with its best_only
parameter:
https://gist.github.com/gdmcdonald/bacfaafe2cccff18b6a81b319cbc3580
This gist offers a solution with its
best_only
parameter:https://gist.github.com/gdmcdonald/bacfaafe2cccff18b6a81b319cbc3580
Nice. It would be useful to make it work with pipes and allow joining on more than one column.
I like the idea of this package, but it does not work well in practice for my needs. I wrote a knitr explaining the problem here.
In brief: often the task is to match up two n-length vectors of strings against each other 1-to-1. The present join algorithm does not enforce 1-to-1 joins, i.e. sometimes one string gets joined to two others, and sometimes one gets joined to none.