Open BERENZ opened 10 months ago
Hi @BERENZ ,
Thanks for letting me know. This is really nice! I don't see it on CRAN yet. Still working on it? Let me know if it is on CRAN; I will then try go get a reference to your package somewhere in the documentation.
One remark: in pair_ann
there is a line:
block_result <- blocking::blocking(x = x[, on], y = if (deduplication)
NULL
else y[, on], deduplication = deduplication, ...)
this fails if x
and/or y
are already a data.table. I think this is easiest to solve by placing the data.table::as.data.table
lines before this line and using x[, on, with = FALSE]
. So, currently:
setDT(linkexample1)
setDT(linkexample2)
pair_ann(x = linkexample1, y = linkexample2, on = "txt", deduplication = FALSE)
gives an error.
Sorry for not getting back earlier.
Hi @djvanderlaan,
Thanks for bug reporting, as always I forgot about with=FALSE
:) If you have any other comments please let me know. I focused on other projects but I think I will be able to submit the package in April.
In addition, we use your package in the mecRecordLinkage an experimental package that implements: Lee, D., Zhang, L-C., and Kim, J.K. (2022). "Maximum entropy classification for record linkage," Survey Methodology, 48, 1-23.
Hi @djvanderlaan,
finally I had some time and fixed several issues with the blocking
package. Feel free to test and verify if it is useful connection with reclin2
package.
I am writing to let you know that I have developed a small package called [
blocking
] (https://github.com/ncn-foreigners/blocking) that allows blocking of records based on approximate nearest neighbours algorithms (RcppAnnoy
,RcppHNSW
andmlpack
) and graphs (igraph
). The package includes the functionpair_ann
, which was developed on the basis ofpair_blocking
andpair_minsim
to allow direct integration into your package.Here is the code using the
reclin2
sample data:Feel free to test and comment. I plan to submit this package to CRAN in December.