djvanderlaan / reclin2

Record Linkage Toolkit for R
GNU General Public License v3.0
37 stars 3 forks source link

Blocking using approximate nearest neighbours algorithms #22

Open BERENZ opened 10 months ago

BERENZ commented 10 months ago

I am writing to let you know that I have developed a small package called [blocking] (https://github.com/ncn-foreigners/blocking) that allows blocking of records based on approximate nearest neighbours algorithms (RcppAnnoy, RcppHNSW and mlpack) and graphs (igraph). The package includes the function pair_ann, which was developed on the basis of pair_blocking and pair_minsim to allow direct integration into your package.

Here is the code using the reclin2 sample data:

library(blocking)
library(reclin2)

data("linkexample1", "linkexample2", package = "reclin2")

linkexample1$txt <- with(linkexample1, tolower(paste0(firstname, lastname, address, sex, postcode)))
linkexample1$txt <- gsub("\\s+", "", linkexample1$txt)
linkexample2$txt <- with(linkexample2, tolower(paste0(firstname, lastname, address, sex, postcode)))
linkexample2$txt <- gsub("\\s+", "", linkexample2$txt)

# pairing records from linkexample2 to linkexample1 based on the txt column

pair_ann(x = linkexample1, y = linkexample2, on = "txt", deduplication = FALSE) |>
compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
score_simple("score", on = "txt") |>
select_threshold("threshold", score = "score", threshold = 0.75) |>
link(selection = "threshold")

Feel free to test and comment. I plan to submit this package to CRAN in December.

djvanderlaan commented 7 months ago

Hi @BERENZ ,

Thanks for letting me know. This is really nice! I don't see it on CRAN yet. Still working on it? Let me know if it is on CRAN; I will then try go get a reference to your package somewhere in the documentation.

One remark: in pair_ann there is a line:

block_result <- blocking::blocking(x = x[, on], y = if (deduplication) 
        NULL
    else y[, on], deduplication = deduplication, ...)

this fails if x and/or y are already a data.table. I think this is easiest to solve by placing the data.table::as.data.table lines before this line and using x[, on, with = FALSE] . So, currently:

setDT(linkexample1)
setDT(linkexample2)
pair_ann(x = linkexample1, y = linkexample2, on = "txt", deduplication = FALSE) 

gives an error.

Sorry for not getting back earlier.

BERENZ commented 6 months ago

Hi @djvanderlaan,

Thanks for bug reporting, as always I forgot about with=FALSE :) If you have any other comments please let me know. I focused on other projects but I think I will be able to submit the package in April.

In addition, we use your package in the mecRecordLinkage an experimental package that implements: Lee, D., Zhang, L-C., and Kim, J.K. (2022). "Maximum entropy classification for record linkage," Survey Methodology, 48, 1-23.

BERENZ commented 4 months ago

Hi @djvanderlaan,

finally I had some time and fixed several issues with the blocking package. Feel free to test and verify if it is useful connection with reclin2 package.