Open humblecoderrr opened 2 weeks ago
That depends. It is not completely clear what you mean/what you are doing. If I understand correctly, this happens during pairing. What function are you using? Could perhaps create/share a little example?
Below an example using pair_blocking
> a <- data.frame(x = c(1,1,NA, 2), y = 1:4)
> a
x y
1 1 1
2 1 2
3 NA 3
4 2 4
> pair_blocking(a, a, on = "x")
First data set: 4 records
Second data set: 4 records
Total number of pairs: 6 pairs
Blocking on: 'x'
.x .y
<int> <int>
1: 3 3
2: 1 1
3: 1 2
4: 2 1
5: 2 2
6: 4 4
This actually surprised me: it generates the pair 3-3, so it does not drop missing values. (I do think it should have generated a warning here).
For pair_minsim
, by default, comparing a missing value will be considered as not-equal. You can see this in the example below where the simularity score of the pair 3-3 is 1 (only y aggrees)
> pair_minsim(a, a, on = c("x", "y"), minsim=1)
First data set: 4 records
Second data set: 4 records
Total number of pairs: 6 pairs
.x .y simsum
<int> <int> <num>
1: 1 1 2
2: 1 2 1
3: 2 1 1
4: 2 2 2
5: 3 3 1
6: 4 4 2
But the default behaviour of pair_minsim
depends on the comparison function used. In this case cmp_identical()
. pair_minsim
and the functions for probabilistic linkage (problink_em
and predict.problink_em
) will call the comparison function to convert missing values into logical values. By default these will turn a NA
into FALSE
> cmp <- cmp_identical()
> cmp(c(1, NA, 2), c(1, 1, 1))
[1] TRUE NA FALSE
> r1 <- cmp(c(1, NA, 2), c(1, 1, 1))
> r1
[1] TRUE NA FALSE
> r2 <- cmp(r1)
> r2
[1] TRUE FALSE FALSE
I hope this answers your question. If not, please try to generate a little example.
My apologies for not leading with a reproducible example. I have a very large data set (10million+ records) and would like to link records into pairs. I discovered that I could do this with reclin by setting deduplication=TRUE in the pair_blocking function (example below). It appears that I can calculate the pair scores just fine with missing information. However, the same is not true when I'm trying to calculate the m_probability values. When NA is detected the m_probability for the pair is returned with NA. My question is... how can i avoid the missing values from having this effect on the probabilities?
> a <- data.frame( first = c('James', 'Jacky', 'Hannah', 'James', 'Jacky', 'Hana'),
last = c(NA, 'Smith', 'Daniels', 'Reynolds', 'Smidt', 'Daniels' ),
DOB = c('1994-02-01', '1954-03-02', '1980-12-01', '1994-02-01', '1954-03-02','1980-12-01'),
firstinit = c('J', 'J', 'H', 'J', 'J', 'H'),
sex = c('M', 'F', 'F', 'M', 'F', 'F'),
zip = c('19102', '19989', '19243', '19102', '19101', NA))
a first last DOB firstinit sex zip 1 James
1994-02-01 J M 19102 2 Jacky Smith 1954-03-02 J F 19989 3 Hannah Daniels 1980-12-01 H F 19243 4 James Reynolds 1994-02-01 J M 19102 5 Jacky Smidt 1954-03-02 J F 19101 6 Hana Daniels 1980-12-01 H F
> pair_bc <- pair_blocking(a, y=FALSE, on=c("DOB", "firstinit", "sex"),
deduplication = TRUE, add_xy = TRUE)
Warning message: In pair_blocking(a, y = FALSE, on = c("DOB", "firstinit", "sex"), : y provided will be ignored.
pair_bc First data set: 6 records Second data set: 6 records Total number of pairs: 3 pairs Blocking on: 'DOB', 'firstinit', 'sex'
.x .y
1: 2 5 2: 3 6 3: 1 4
> pairs <- compare_pairs(pair_bc, on = c("first", "last", "DOB", "firstinit", "sex", "zip"),
comparators =list( first= cmp_jarowinkler(.95),
last = cmp_jarowinkler(.85),
DOB = cmp_identical(),
firstinit = cmp_identical(),
sex = cmp_identical(),
zip=cmp_lcs()),
inplace = TRUE)
> pair_scores <- score_simple(pairs , "sim_score", on=c("first", "last", "DOB", "firstinit", "sex", "zip"))
pair_scores First data set: 6 records Second data set: 6 records Total number of pairs: 3 pairs Blocking on: 'DOB', 'firstinit', 'sex'
.x .y first last DOB firstinit sex zip sim_score
1: 2 5 1.0000000 0.8666667 TRUE TRUE TRUE 0.4 5.266667 2: 3 6 0.8888889 1.0000000 TRUE TRUE TRUE NA 4.888889 3: 1 4 1.0000000 NA TRUE TRUE TRUE 1.0 5.000000
> m <- problink_em(~ first + last + DOB + firstinit + sex + zip , data=pair_scores )
There were 11 warnings (use warnings() to see them)
> pair_combos <- predict(m, pairs = pair_scores , add = TRUE, type = c( "all"))
pair_combos First data set: 6 records Second data set: 6 records Total number of pairs: 3 pairs Blocking on: 'DOB', 'firstinit', 'sex'
.x .y first last DOB firstinit sex zip sim_score mprob uprob mpost
1: 2 5 1.0000000 0.8666667 TRUE TRUE TRUE 0.4 5.266667 0.2206851 0.0005575673 0.9999939 2: 3 6 0.8888889 1.0000000 TRUE TRUE TRUE NA 4.888889 NA NA NA 3: 1 4 1.0000000 NA TRUE TRUE TRUE 1.0 5.000000 NA NA NA upost weight 1: 6.076612e-06 5.980909 2: NA 1.322975 3: NA 14.542525
Not sure on in how much detail I have to answer.
When estimating the model using problink_em
the model will assume that NA
correspons to non-agreement so 0. problink_em
can only work with 0/1 values. Therefore it will also truncate the string-distance scores. The values in cmp_jarowinkler
control that. How missing values are treated depends on the comparison function, but the ones supplied with the package assume NA = 0.
However, when calculating the probabilities, predict.problink_em
, will work with the numeric values. It will scale the probabilties between those of 0 and those of 1. This causes the missing values. Perhaps the function should also assume NA = 0 when calculating the probabilities.
However, when calculating the weights the missing values are treated differently. Usually (not in this example; because of the small size it converged to weird estimates), non aggreement on a variable leads to a negative contribution to the total weight; agreement leads to a positive weight. As NA indicates missing information; it is assumed that this is better modelled with a contribution of 0 to the total weight.
So the weights can be used. And usually pair selection is done using the weights (in theory this would actually be optimal; this is wat Fellegi-Sunter proved). But mpost is also often used.
I will think about whether is makes sense to set the NA also to 0 for calculating the probabilities (not the weights) in predict.problink_em
. Perhaps that is better. In the mean time, if you want to use the probabilities, you can set the missing values to 0 before calculating the predictions. If you are not using the weights that should not matter.
One way would be to have custom comparator functions that return 0 when NA:
cmp_jarowinkler2 <- function(threshold = 0.95) {
function(x, y) {
if (!missing(y)) {
res <- 1-stringdist::stringdist(x, y, method = "jw")
res[is.na(res)] <- 0
res
} else {
(x > threshold) & !is.na(x)
}
}
}
cmp_lcs2 <- function(threshold = 0.80) {
function(x, y) {
if (!missing(y)) {
d <- stringdist::stringdist(x, y, method = "lcs")
maxd <- nchar(x) + nchar(y)
res <- 1 - d/maxd
res[is.na(res)] <- 0
res
} else {
(x > threshold) & !is.na(x)
}
}
}
Thank you for taking the time to provide an in-depth response. I had this conversation with my team a few weeks ago to explore replacing NA's with 0 values. They were concerned that it would result in more false positive matches. But I think the lesson here would be to do a thorough evaluation and incorporate weights to classify the pairs. I really appreciate your work on this!
Hi Jan,
Reclin2 has been a life saver. I have to process and link over 10+million records. However, I recently discovered that the compare_pairs function does not handle missing values well. In fact, a record gets penalized if it has missing values. I have several records that got eliminated during the pairing phase, because they had missing information on some of the matching criteria. Is this a correct assessment?