Discussion of size of cure fraction and utility of split approach

(5) because the paper does not include any simulation results, would a short description of the performance of the split population model for different ‘cure’ rates enhance users’ confidence in using this package to re-analyze their data.

@dhill138 any way we can answer this without having to end up doing simulations?

@andybega I don’t think so. I found one paper that examines performance under different cure rates, but it’s for a binomial-exponential mixture model with no covariates. I do have very old code for a Weibull simulation with time-invariant covariates. Revamping it and coding up a log-logistic version seems like too much to me. I’m happy to punt on this one. Any other thoughts?

Where's that paper? I'm happy to punt too.

I guess, generally, at the edge cases:

no cured in the unobservable world: all cases are at risk, although we wouldn't code any censored cases as such. Depending on how the logistic part works, we might classify those falsely assumed cured spells as being at risk after all. I guess compared to a regular non-mixture duration model the split-pop model would underestimate the hazard because it assumes censored cases are not at risk.
all cured in the unobservable world: this implies no failures, so it's kind of trivial. We have no data to model with.
- however, mostly cured, or at least lots of cured: a regular duration model would underestimate the hazard because it averages out over cured spells. Our split-population model would have less underestimation, depending on how well the split part works.

It seems either way we are somewhat constrained by the fact that we don't observe cured/at risk status, and we have a non-perfect heuristic of back-coding risk for spells that end in failure.

How much the split pop model helps correct hazard estimates in instances where there is a cured fraction of the population then I think would depend on how well our "at risk" heuristic performs. That in turn I would think is better with a "fast-failing" phenomenon for example, where it quickly becomes safe to assume that non-failing cases are in fact cured.

Another thought, we can look at what the model thinks is the fraction of cured/at risk cases. Maybe include some code for that?

Lastly, maybe also worth pointing out that the canonical applications all had a clear theoretical case for assuming that there is a cured fraction...not all people smoke during their lifetime, some cancer patients really are cured, some prisoners do really not relapse, etc.

Thanks for all of this, Andy. Your intuition makes a lot of sense. I will discuss this in the reviewer memo and add a bit of text to the paper. Would you mind changing the package so that the model object includes an estimate of the proportion of cured observations, and so that this is reported when using summary on the model object? I think this would be a good addition as it seems for many applications (esp. in the medical field) users are concerned primarily with this estimate rather than coefficient estimates from each equation. It also has implications for model performance, per your comments.

How about something like this in the memo, with attendant changes to the manuscript: We appreciate the suggestion to add discussion of model performance under diffferent cure rates. Unfortunately we were unable to find existing studies that speak to this question for our particular model, and we believe that conducting such simulations would warrant a separate paper. Our intuition about model performance is as follows. When no units are cured, the model may underestimate the hazard relative to a standard model since it assumes censored cases are not at risk. When all cases are cured this would entail a sample of only censored observations, i.e. no failures and thus no variation in one of the response variables the model requires. The more interesting case is where some but not all units are cured. In this case it seems likely that a standard parametric duration model would underestimate the hazard by a larger amount as the proportion of cured observations increases. As such, we think the model is most appropriate in cases where there is a strong theoretical reason to suspect some units are cured, as in the canonical applications of these models (not everyone smokes, some cancer patients are cured, some convicts do not relapse, etc.), and we have added some text along these lines to the paper on p.\ X. Ultimately we cannot observe the proportion of units that are cured, and the usefulness of the model is its ability to draw a probabilistic inference about this unobservable process. The model {\it can} provide an estimate of the cure probability for each observation and thus for the proportion of the sample that is cured, and we have modified the package so that the model object includes an estimate of the proportion of cured observations. This estimate should give users an indication of how appropriate the model is for their application.

Everyone else okay with this? @wardlab @s7minhas @nilswmetternich

I attached the one paper I found. It's rather old. If anyone manages to hunt down anything else please pass it along. Survivorship Analysis (Hill).pdf

Yup, you guys are obviously more familiar with these models than me, so I don’t have anything else to add but to note that the reviewer memo looks good.

On October 18, 2017 at 3:36:06 PM, dhill138 (notifications@github.com) wrote:

Thanks for all of this, Andy. Your intuition makes a lot of sense. I will discuss this in the reviewer memo and add a bit of text to the paper. Would you mind changing the package so that the model object includes an estimate of the proportion of cured observations, and so that this is reported when using summary on the model object? I think this would be a good addition as it seems for many applications (esp. in the medical field) users are concerned primarily with this estimate rather than coefficient estimates from each equation. It also has implications for model performance, per your comments.

How about something like this in the memo, with attendant changes to the manuscript: We appreciate the suggestion to add discussion of model performance under diffferent cure rates. Unfortunately we were unable to find existing studies that speak to this question for our particular model, and we believe that conducting such simulations would warrant a separate paper. Our intuition about model performance is as follows. When no units are cured, the model may underestimate the hazard relative to a standard model since it assumes censored cases are not at risk. When all cases are cured this would entail a sample of only censored observations, i.e. no failures and thus no variation in one of the response variables the model requires. The more interesting case is where some but not all units are cured. In this case it seems likely that a standard parametric duration model would underestimate the hazard by a larger amount as the proportion of cured observations increases. As such, we think the model is most appropriate in cases where there is a strong theoretical reason to suspect some units are cured, as in the canonical applications of these models (not everyone smokes, some cancer patients are cured, some convicts do not relapse, etc.), and we have added some text along these lines to the paper on p.\ X. Ultimately we cannot observe the proportion of units that are cured, and the usefulness of the model is its ability to draw a probabilistic inference about this unobservable process. The model {\it can} provide an estimate of the cure probability for each observation and thus for the proportion of the sample that is cured, and we have modified the package so that the model object includes an estimate of the proportion of cured observations. This estimate should give users an indication of how appropriate the model is for their application.

Everyone else okay with this? @wardlab https://github.com/wardlab @s7minhas https://github.com/s7minhas @nilswmetternich https://github.com/nilswmetternich

I attached the one paper I found. It's rather old. If anyone manages to hunt down anything else please pass it along. Survivorship Analysis (Hill).pdf https://github.com/dhill138/spduration-paper/files/1395901/Survivorship.Analysis.Hill.pdf

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dhill138/spduration-paper/issues/5#issuecomment-337704045, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQWRnmajdeD_suTW7ltplsIxv4HS6Nxks5stlMmgaJpZM4PkD93 .

From Danny:

instead of changing the code to estimate a cure rate and adding this to the model object, I thought I would just add a few lines showing how to estimate the cure rate with the predict function, since it will calculate cure probabilities. For that would you use the conditional or unconditional cure probabilities? And would you calculate the mean predicted probability, or convert the probabilities to predicted cure outcomes and take the mean, like
cure_probs <- predict(weib_model, type = "unconditional cure")
cure_rate <- mean(as.numeric(cure_probs > 0.5))
cure_rate
I would think the latter, but I'm not sure.

I think showing example code in the paper is preferable to adding something to the package because what the cure rate is I think is a bit more complicated when you have time varying data and potentially multiple spells per subject.

I would take the mean of the conditional cure without dichotomizing, e.g. picking up with the example in the paper:

bscoup$ucure_prob <- predict(weib_model, type = "unconditional cure", newdata = bscoup, na.action = "na.exclude")
bscoup$ccure_prob <- predict(weib_model, type = "conditional cure", newdata = bscoup, na.action = "na.exclude")
mean(bscoup$ccure_prob, na.rm = TRUE)

Not sure I could defend why to use the cure rate conditional on survival time rather than the cure rate conditional only on covariates. FWIW, I think in practice the cure rate estimates tend to cluster towards 0 and 1, and the unconditional and conditional rates are highly correlated, so which to use and whether to dichotomize before averaging doesn't make a huge difference, at least with the example we have going.

E.g.:

hist(bscoup$ccure_prob)

screen shot 2017-10-25 at 10 25 52

versus:

hist(bscoup$ucure_prob)

screen shot 2017-10-25 at 10 26 30

And about cure rates being harder to define with time-varying data, here are some examples of what the cond. cure rates estimates look like for a sample of countries:

ggplot(bscoup[bscoup$countryid %in% sample(unique(bscoup$countryid), 10), ], 
       aes(x = year, y = ccure_prob, group = countryid)) + 
  geom_step(aes(colour = factor(countryid))) + 
  scale_colour_discrete(guide = FALSE) + 
  theme_bw()

screen shot 2017-10-25 at 10 23 57

If it changes over time for a country, from 0 to something near 1 by the end, is that country cured, not cured? And what about countries with multiple spells? All gets complicated, and maybe the decisions for dealing with these questions differ by application...

Thanks for your reply, Andy. Your point about cure rates with over time data and multiple failures is well taken. I pasted the code for the conditional cure rate into the paper and added some text.

On Wed, Oct 25, 2017 at 3:29 AM, Andreas Beger notifications@github.com wrote:

From Danny:

instead of changing the code to estimate a cure rate and adding this to the model object, I thought I would just add a few lines showing how to estimate the cure rate with the predict function, since it will calculate cure probabilities. For that would you use the conditional or unconditional cure probabilities? And would you calculate the mean predicted probability, or convert the probabilities to predicted cure outcomes and take the mean, like

cure_probs <- predict(weib_model, type = "unconditional cure") cure_rate <- mean(as.numeric(cure_probs > 0.5)) cure_rate

I would think the latter, but I'm not sure.

I think showing example code in the paper is preferable to adding something to the package because what the cure rate is I think is a bit more complicated when you have time varying data and potentially multiple spells per subject.

I would take the mean of the conditional cure without dichotomizing, e.g. picking up with the example in the paper:

bscoup$ucure_prob <- predict(weib_model, type = "unconditional cure", newdata = bscoup, na.action = "na.exclude") bscoup$ccure_prob <- predict(weib_model, type = "conditional cure", newdata = bscoup, na.action = "na.exclude") mean(bscoup$ccure_prob, na.rm = TRUE)

Not sure I could defend why to use the cure rate conditional on survival time rather than the cure rate conditional only on covariates. FWIW, I think in practice the cure rate estimates tend to cluster towards 0 and 1, and the unconditional and conditional rates are highly correlated, so which to use and whether to dichotomize before averaging doesn't make a huge difference, at least with the example we have going.

E.g.:

hist(bscoup$ccure_prob)

[image: screen shot 2017-10-25 at 10 25 52] https://user-images.githubusercontent.com/1353756/31985804-f0844598-b96e-11e7-9d0e-68f81ed369ff.png

versus:

hist(bscoup$ucure_prob)

[image: screen shot 2017-10-25 at 10 26 30] https://user-images.githubusercontent.com/1353756/31985820-ff7a4e9e-b96e-11e7-82bf-bfd4e362a87b.png

And about cure rates being harder to define with time-varying data, here are some examples of what the cond. cure rates estimates look like for a sample of countries:

ggplot(bscoup[bscoup$countryid %in% sample(unique(bscoup$countryid), 10), ], aes(x = year, y = ccure_prob, group = countryid)) + geom_step(aes(colour = factor(countryid))) + scale_colour_discrete(guide = FALSE) + theme_bw()

[image: screen shot 2017-10-25 at 10 23 57] https://user-images.githubusercontent.com/1353756/31985887-33076d6e-b96f-11e7-8148-5060e7980651.png

If it changes over time for a country, from 0 to something near 1 by the end, is that country cured, not cured? And what about countries with multiple spells? All gets complicated, and maybe the decisions for dealing with these questions differ by application...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dhill138/spduration-paper/issues/5#issuecomment-339240392, or mute the thread https://github.com/notifications/unsubscribe-auth/ABl_Z6-v_pAny9OmOyv4LRSaVvoILSzIks5svuNWgaJpZM4PkD93 .

dhill138 / spduration-paper

Discussion of size of cure fraction and utility of split approach #5