djvanderlaan / reclin2

Record Linkage Toolkit for R
GNU General Public License v3.0
39 stars 3 forks source link

Missing values for matching criteria #29

Open humblecoderrr opened 2 weeks ago

humblecoderrr commented 2 weeks ago

Hi Jan,

Reclin2 has been a life saver. I have to process and link over 10+million records. However, I recently discovered that the compare_pairs function does not handle missing values well. In fact, a record gets penalized if it has missing values. I have several records that got eliminated during the pairing phase, because they had missing information on some of the matching criteria. Is this a correct assessment?

djvanderlaan commented 2 weeks ago

That depends. It is not completely clear what you mean/what you are doing. If I understand correctly, this happens during pairing. What function are you using? Could perhaps create/share a little example?

Below an example using pair_blocking

> a <- data.frame(x = c(1,1,NA, 2), y = 1:4)
> a
   x y
1  1 1
2  1 2
3 NA 3
4  2 4
> pair_blocking(a, a, on = "x")
  First data set:  4 records
  Second data set: 4 records
  Total number of pairs: 6 pairs
  Blocking on: 'x'

      .x    .y
   <int> <int>
1:     3     3
2:     1     1
3:     1     2
4:     2     1
5:     2     2
6:     4     4

This actually surprised me: it generates the pair 3-3, so it does not drop missing values. (I do think it should have generated a warning here).

For pair_minsim, by default, comparing a missing value will be considered as not-equal. You can see this in the example below where the simularity score of the pair 3-3 is 1 (only y aggrees)

> pair_minsim(a, a, on = c("x", "y"), minsim=1)
  First data set:  4 records
  Second data set: 4 records
  Total number of pairs: 6 pairs

      .x    .y simsum
   <int> <int>  <num>
1:     1     1      2
2:     1     2      1
3:     2     1      1
4:     2     2      2
5:     3     3      1
6:     4     4      2

But the default behaviour of pair_minsim depends on the comparison function used. In this case cmp_identical(). pair_minsim and the functions for probabilistic linkage (problink_em and predict.problink_em) will call the comparison function to convert missing values into logical values. By default these will turn a NA into FALSE

> cmp <- cmp_identical()
> cmp(c(1, NA, 2), c(1, 1, 1))
[1]  TRUE    NA FALSE
> r1 <- cmp(c(1, NA, 2), c(1, 1, 1))
> r1
[1]  TRUE    NA FALSE
> r2 <- cmp(r1)
> r2
[1]  TRUE FALSE FALSE

I hope this answers your question. If not, please try to generate a little example.

humblecoderrr commented 2 weeks ago

My apologies for not leading with a reproducible example. I have a very large data set (10million+ records) and would like to link records into pairs. I discovered that I could do this with reclin by setting deduplication=TRUE in the pair_blocking function (example below). It appears that I can calculate the pair scores just fine with missing information. However, the same is not true when I'm trying to calculate the m_probability values. When NA is detected the m_probability for the pair is returned with NA. My question is... how can i avoid the missing values from having this effect on the probabilities?

> a <- data.frame( first = c('James', 'Jacky', 'Hannah', 'James', 'Jacky', 'Hana'), 
                  last = c(NA, 'Smith', 'Daniels', 'Reynolds', 'Smidt', 'Daniels' ),
                 DOB = c('1994-02-01', '1954-03-02', '1980-12-01', '1994-02-01', '1954-03-02','1980-12-01'),
                 firstinit = c('J', 'J', 'H', 'J', 'J', 'H'),
                sex = c('M', 'F', 'F', 'M', 'F', 'F'), 
                zip = c('19102', '19989', '19243', '19102', '19101', NA))

a first last DOB firstinit sex zip 1 James 1994-02-01 J M 19102 2 Jacky Smith 1954-03-02 J F 19989 3 Hannah Daniels 1980-12-01 H F 19243 4 James Reynolds 1994-02-01 J M 19102 5 Jacky Smidt 1954-03-02 J F 19101 6 Hana Daniels 1980-12-01 H F

> pair_bc <- pair_blocking(a, y=FALSE, on=c("DOB", "firstinit", "sex"), 
                         deduplication = TRUE, add_xy = TRUE)

Warning message: In pair_blocking(a, y = FALSE, on = c("DOB", "firstinit", "sex"), : y provided will be ignored.

pair_bc First data set: 6 records Second data set: 6 records Total number of pairs: 3 pairs Blocking on: 'DOB', 'firstinit', 'sex'

.x .y

1: 2 5 2: 3 6 3: 1 4
> pairs <- compare_pairs(pair_bc, on = c("first", "last", "DOB", "firstinit", "sex",  "zip"), 
                        comparators =list( first= cmp_jarowinkler(.95),
                                           last = cmp_jarowinkler(.85),
                                           DOB = cmp_identical(), 
                                           firstinit = cmp_identical(), 
                                           sex = cmp_identical(),
                                          zip=cmp_lcs()),
                        inplace = TRUE)
> pair_scores  <- score_simple(pairs , "sim_score", on=c("first", "last", "DOB", "firstinit", "sex", "zip"))

pair_scores First data set: 6 records Second data set: 6 records Total number of pairs: 3 pairs Blocking on: 'DOB', 'firstinit', 'sex'

.x .y first last DOB firstinit sex zip sim_score

1: 2 5 1.0000000 0.8666667 TRUE TRUE TRUE 0.4 5.266667 2: 3 6 0.8888889 1.0000000 TRUE TRUE TRUE NA 4.888889 3: 1 4 1.0000000 NA TRUE TRUE TRUE 1.0 5.000000
> m <- problink_em(~ first + last + DOB + firstinit + sex + zip , data=pair_scores )

There were 11 warnings (use warnings() to see them)

> pair_combos <- predict(m, pairs = pair_scores , add = TRUE, type = c( "all"))

pair_combos First data set: 6 records Second data set: 6 records Total number of pairs: 3 pairs Blocking on: 'DOB', 'firstinit', 'sex'

Here is my problem:

.x .y first last DOB firstinit sex zip sim_score mprob uprob mpost

1: 2 5 1.0000000 0.8666667 TRUE TRUE TRUE 0.4 5.266667 0.2206851 0.0005575673 0.9999939 2: 3 6 0.8888889 1.0000000 TRUE TRUE TRUE NA 4.888889 NA NA NA 3: 1 4 1.0000000 NA TRUE TRUE TRUE 1.0 5.000000 NA NA NA upost weight 1: 6.076612e-06 5.980909 2: NA 1.322975 3: NA 14.542525
djvanderlaan commented 1 week ago

Not sure on in how much detail I have to answer.

When estimating the model using problink_em the model will assume that NA correspons to non-agreement so 0. problink_em can only work with 0/1 values. Therefore it will also truncate the string-distance scores. The values in cmp_jarowinkler control that. How missing values are treated depends on the comparison function, but the ones supplied with the package assume NA = 0.

However, when calculating the probabilities, predict.problink_em, will work with the numeric values. It will scale the probabilties between those of 0 and those of 1. This causes the missing values. Perhaps the function should also assume NA = 0 when calculating the probabilities.

However, when calculating the weights the missing values are treated differently. Usually (not in this example; because of the small size it converged to weird estimates), non aggreement on a variable leads to a negative contribution to the total weight; agreement leads to a positive weight. As NA indicates missing information; it is assumed that this is better modelled with a contribution of 0 to the total weight.

So the weights can be used. And usually pair selection is done using the weights (in theory this would actually be optimal; this is wat Fellegi-Sunter proved). But mpost is also often used.

I will think about whether is makes sense to set the NA also to 0 for calculating the probabilities (not the weights) in predict.problink_em. Perhaps that is better. In the mean time, if you want to use the probabilities, you can set the missing values to 0 before calculating the predictions. If you are not using the weights that should not matter.

One way would be to have custom comparator functions that return 0 when NA:

cmp_jarowinkler2 <- function(threshold = 0.95) {
  function(x, y) {
    if (!missing(y)) {
      res <- 1-stringdist::stringdist(x, y, method = "jw")
      res[is.na(res)] <- 0
      res
    } else {
      (x > threshold) & !is.na(x)
    }
  }
}

cmp_lcs2 <- function(threshold = 0.80) {
  function(x, y) {
    if (!missing(y)) {
      d <- stringdist::stringdist(x, y, method = "lcs")
      maxd <- nchar(x) + nchar(y)
      res <- 1 - d/maxd
      res[is.na(res)] <- 0
      res
    } else {
      (x > threshold) & !is.na(x)
    }
  }
}
humblecoderrr commented 1 week ago

Thank you for taking the time to provide an in-depth response. I had this conversation with my team a few weeks ago to explore replacing NA's with 0 values. They were concerned that it would result in more false positive matches. But I think the lesson here would be to do a thorough evaluation and incorporate weights to classify the pairs. I really appreciate your work on this!