alysonvanraalte / LifeIneq

An R-package to calculate measures of lifespan variation from a life table
7 stars 6 forks source link

problems with ties in `lx`, and a potential solution. #16

Open timriffe opened 1 year ago

timriffe commented 1 year ago

Ties in lx cause problems for our quantile functions, all of which use monotonic splines. Example

age = 0:10
lx = c(10,9,8,7,7,5,4,3,2,1,0)/10
splinefun(age~lx, method = "monoH.FC")(.5)
# gives warning:
[1] 5
Warning message:
In regularize.values(x, y, ties, missing(ties)) :
  collapsing to unique 'x' values

In other applications that would come up using HMD lx values, warnings will also ensue.

I looked into how other quantile-stuff implementations handle these situations. They use dithering. Here's an lx-specific dither routine that I propose, based on a subset of code found in quantreg::dither(), given that for lx we need to insist on monotonicity. In essence, IFF there are ties, then we generate random uniform deviates between 0 and 10000th of the radix, sort them, add them to lx, then ensure lx has the original radix.

maybe_dither_lx <- function(lx, value = .0001){
   if(any(duplicated(lx))){
     N     <- length(lx)
     radix <- lx[1]
     # random values between 0 and 1000th of radix
     dith  <- runif(N, 0, radix * value) 
     # sort to enforce monotonicity
     dith  <- sort(dith, decreasing = TRUE)
     # apply
     lx    <- lx + dith
     #rescale to same radix
     lx    <- lx * radix / lx[1]
   }
  lx
}
  lx
}

In case there are no ties, then the same lx is spit back, otherwise we do this little perturbation to eliminate ties without compromising the shape. For the above case, we get a reasonable result and no warning:

splinefun(age~maybe_dither_lx(lx), method = "monoH.FC")(.5)
[1] 4.999998
For the more frequent case of consecutive 0s in the highest ages of lx, the behavior is like so (USA bltper_1x1 1936). As is, we get errors, with dithering we don't.

lx = c(100000L, 94002L, 93175L, 92781L, 92517L, 92307L, 92133L, 91976L, 
91834L, 91704L, 91583L, 91467L, 91353L, 91233L, 91103L, 90956L, 
90788L, 90596L, 90381L, 90145L, 89891L, 89621L, 89337L, 89041L, 
88735L, 88420L, 88098L, 87769L, 87434L, 87094L, 86749L, 86402L, 
86046L, 85678L, 85293L, 84883L, 84446L, 83980L, 83486L, 82966L, 
82425L, 81867L, 81291L, 80695L, 80074L, 79422L, 78730L, 77994L, 
77210L, 76375L, 75486L, 74541L, 73538L, 72480L, 71369L, 70212L, 
69013L, 67767L, 66468L, 65106L, 63671L, 62153L, 60547L, 58849L, 
57054L, 55160L, 53162L, 51072L, 48905L, 46673L, 44387L, 42055L, 
39667L, 37207L, 34669L, 32056L, 29476L, 26875L, 24282L, 21735L, 
19275L, 17001L, 14842L, 12805L, 10902L, 9149L, 7561L, 6149L, 
4912L, 3851L, 2962L, 2241L, 1671L, 1233L, 903L, 658L, 458L, 311L, 
207L, 135L, 85L, 53L, 32L, 19L, 11L, 6L, 3L, 2L, 1L, 0L, 0L)

ineq_quantile(0:110,lx)
Error in splinefun(age ~ lx, method = "monoH.FC") : zero non-NA points
In addition: There were 50 or more warnings (use warnings() to see the first 50)

# succeeds
ineq_quantile(0:110,maybe_dither_lx(lx))

In this case, for all but the lowest ages, residuals are negative, observe:

# truncate for default behavior without errors:
q1 <- ineq_quantile(0:108,lx[1:109])
# apply dithering the whole thing
q2 <- ineq_quantile(0:110,maybe_dither_lx(lx,.0001))

resid <- q1 - q2[1:109]
# y scale in weeks
plot(0:108,resid*52)

image

So, trivially negative until the tail. One can simply set a default to a lower value:

q3 <- ineq_quantile(0:110,maybe_dither_lx(lx,.00001))

resid2 <- q1 - q3[1:109]
# y scale in weeks
plot(0:108,resid*52)
lines(0:108,resid2*52)

image

This reduces most age residuals, but then squeezes larger errors to a spike at the "end". No matter what value we choose, warnings and errors are eliminated, but we'd of course like to reduce the consequence. Does something like this seem reasonable to you, like to just do it automatically to remove user headaches? That final "residual" is in the second case is a mere 17.68 weeks of age, or .345 years. Any thoughts on a solution in this direction?

timriffe commented 1 year ago

Further testing adds this nuance: the place to apply something like maybe_dither_lx() is not inside ineq_quantile_lower(), but inside ineq_quantile(), where we pad the top with 0 by default, ergo:

ineq_quantile <- function(age, lx, quantile = .5){
  stopifnot(length(age) == length(lx))
  n    <- length(age)
  lx   <- c(lx, 0)

  # suggested new line
  lx   <- maybe_dither_lx(lx,.0001)

  a    <- c(age, max(age) + 1)
  qs   <- rep(0,n)
  for (i in 1:n){
    qs[i] <- ineq_quantile_lower(age = a[i:(n+1)],
                                 lx = lx[i:(n+1)],
                                 quantile = quantile)
  }
  qs - age
}

We would also need to add it into ineq_iqr() and ineq_cp(), in general this is always something to do on the top level rather than in the innermost function ineq_quantile_lower()