ddalthorp / GenEst

R package development for a generalized mortality estimator
Other
6 stars 6 forks source link

integrate new cp functionality #72

Closed juniperlsimonis closed 6 years ago

juniperlsimonis commented 6 years ago

bring code in standardize formatting check and fix downstream issues edge case testing

juniperlsimonis commented 6 years ago

holding on integrating the code fully until I have a code formatting guide from @ddalthorp

juniperlsimonis commented 6 years ago

@ddalthorp you have a new branch with 'cp' in the title. I'm presuming I should integrate that, rather than what's in @prabie 's branch? Also, let me know if it's ready to get integrated or if I should hold off.

ddalthorp commented 6 years ago

I wrote the cpm and rcp functions that are currently (2018.03.07) in my cp branch before Paul's revision to the cp model. The purpose of the functions is to provide convenient command-line parallels to pkm and rpk, but they are (obviously) rough drafts...not coordinated with Paul's multi-scale options, minimal error-checking, no help files. After reconciling with Paul's and adding skeleton help sections they should be ready for integration.

juniperlsimonis commented 6 years ago

roger that, thanks for the clarification! i'll get on that shortly.

ddalthorp commented 6 years ago

To come up with good starting values for fitting the multi-scale CP function, would it make sense to use survival::survreg estimates based on the constant scale model? This would ease some concerns I have about black-boxing multi-dimensional optimizations. As a by-product, it gives the standard uni-scale fit as an alternative that we could compare to the multi-scale model. (Although, I'm not sure something like AIC would give a meaningful comparison because the question of interest will not be "Which cp model fits better?" but "Does the multi-scale model give more reliable estimates of g and M for the splits that the user is interested in? Does the benefit outweigh the cost of doubling the number of parameters?" AIC would give a first pass at an answer to those questions...)

The solution would be something like:

mod <- survival::survreg(<model with one scale parameter>)
betaInit <- c(mod$coef, rep(mod$scale, length(mod$coef))

Code below:

## New --->
  t1 <- data[ , left]
  t2 <- data[ , right]
  event <- rep(3, length(t1))
  event[is.na(t2) | is.infinite(t2)] <- 0
  event[round(t1, 3) == round(t2, 3)] <- 1
  t1 <- pmax(t1, 0.0001)
  cpt <- survival::Surv(time = t1, time2 = t2, event = event, type = "interval")
  imod <- survival::survreg(
    reformulate(as.character(formulaRHS)[-1], response = 'cpt'),
    data = data,
    dist = dist
  )
  betaInit <- c(imod$coef, rep(imod$scale, length(imod$coef)))
## <--- New

## Old --->
#  if (any(is.infinite(t2))){
#    t2[which(is.infinite(t2))] <- NA
#  }
#  t12 <- (t1 + pmin(t2, 2 * max(t2, na.rm = TRUE), na.rm = TRUE)) / 2
#
#  betaInit1 <- switch(dist,
#                   "loglogistic" = c(mean(log(t12)), var(log(t12)) / 3.2),
#                   "exponential" = 1 / mean(t12),
#                   "lognormal" = c(sqrt(var(t12)), log(mean(t12))),
#                   "weibull" = weibullStart(t12))
#  nparam <- length(betaInit1)
#  betaInit <- matrix(0, nrow = ncol(cellMM), ncol = nparam, byrow = TRUE)
#  betaInit[1, ] <- betaInit1
#  betaInit <- as.vector(betaInit)
## <--- Old
prabie commented 6 years ago

I think that's a fine idea. I'm not sure how the current code black-boxes multi-dimensional optimizations but I don't see a big disadvantage to using survreg to generate starting values. I think it's a nice idea to ask if a model with the common scale parameter is a more parsimonious fit.

Western Ecosystems Technology, Inc. Environmental & Statistical Consultants 200 S. Second Street Laramie, WY 82070 307-755-9447 www.west-inc.com

Follow WEST: Facebook http://www.facebook.com/pages/Western%E2%80%90EcoSystems%E2%80%90Technology%E2%80%90WESTInc/125604770807646 , Twitter http://twitter.com/WestEcoSystems, Linked In http://www.linkedin.com/company/1458419, Join our Mailing list http://visitor.r20.constantcontact.com/manage/optin/ea?v=001qrD4A3S5xJ5KgMyelH9jyw%3D%3D

CONFIDENTIALITY NOTICE: This message and any accompanying communications are covered by the Electronic Communications Privacy Act, 18 U.S.C. §§ 2510-2521, and contain information that is privileged, confidential or otherwise protected from disclosure. If you are not the intended recipient or an agent responsible for delivering the communication to the intended recipient, you are hereby notified that you have received this communication in error. Dissemination, distribution or copying of this e-mail or the information herein by anyone other than the intended recipient, or an employee or agent responsible for delivering the message to the intended recipient, is prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the original message. Thank you.

P Please consider the environment before printing.

On Wed, Mar 14, 2018 at 4:01 PM, ddalthorp notifications@github.com wrote:

To come up with good starting values for fitting the multi-scale CP function, would it make sense to use survival::survreg estimates based on the constant scale model? This would ease some concerns I have about black-boxing multi-dimensional optimizations. As a by-product, it gives the standard uni-scale fit as an alternative that we could compare to the multi-scale model. (Although, I'm not sure something like AIC would give a meaningful comparison because the question of interest will not be "Which cp model fits better?" but "Does the multi-scale model give more reliable estimates of g and M for the splits that the user is interested in? Does the benefit outweigh the cost of doubling the number of parameters?" AIC would give a first pass at an answer to those questions...)

The solution would be something like:

mod <- survival::survreg() betaInit <- c(mod$coef, rep(mod$scale, length(mod$coef))

Code below:

New --->

t1 <- data[ , left] t2 <- data[ , right] event <- rep(3, length(t1)) event[is.na(t2) | is.infinite(t2)] <- 0 event[round(t1, 3) == round(t2, 3)] <- 1 t1 <- pmax(t1, 0.0001) cpt <- survival::Surv(time = t1, time2 = t2, event = event, type = "interval") imod <- survival::survreg( reformulate(as.character(formulaRHS)[-1], response = 'cpt'), data = data, dist = dist ) betaInit <- c(imod$coef, rep(imod$scale, length(imod$coef)))

<--- New

Old --->

if (any(is.infinite(t2))){

t2[which(is.infinite(t2))] <- NA

}

t12 <- (t1 + pmin(t2, 2 * max(t2, na.rm = TRUE), na.rm = TRUE)) / 2

#

betaInit1 <- switch(dist,

"loglogistic" = c(mean(log(t12)), var(log(t12)) / 3.2),

"exponential" = 1 / mean(t12),

"lognormal" = c(sqrt(var(t12)), log(mean(t12))),

"weibull" = weibullStart(t12))

nparam <- length(betaInit1)

betaInit <- matrix(0, nrow = ncol(cellMM), ncol = nparam, byrow = TRUE)

betaInit[1, ] <- betaInit1

betaInit <- as.vector(betaInit)

<--- Old

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ddalthorp/GenEst/issues/72#issuecomment-373189475, or mute the thread https://github.com/notifications/unsubscribe-auth/Acw22noP63gCX3Ch7O73iIIv1C3p4PpQks5teZMbgaJpZM4SZT_5 .

ddalthorp commented 6 years ago

By "black-boxing", I just meant that when we package optimizations into a GUI, the user loses will lose control over options like method, par, parscale, etc.

The same "black-boxing" would apply to the pk model...and especially would apply to any MCMC we might do in future because MCMC is often harder to tune than functions like optim or nlminb. That's one of Lisa's concerns about trying to pull together a Bayes way to fit the pk model--- it may solve some problems (e.g., the goofy bimodal k possibilities; the "special cases" of empty cells, y = n, y = 0, single search, etc.), but it introduces sticky new problems too (more importance on picking robust, one-size-fits-all tuning parameters, more statistical issues to resolve in the modeling, more intricate and extensive coding).

juniperlsimonis commented 6 years ago

re: constant scale model, that will be included as the intercept-only scale model, right? as for "black box" issues...we can totally allow the user to pass arguments down to lower-level functions, either specifically (using the argument name) or generally (by including the ... argument)

ddalthorp commented 6 years ago

Having a ... arg at the end is fine, but I don't think we need to make those args available to GUI users.

Sorry for my loose language regarding "constant scale" model...I was referring to the model with covariates and cells but assuming that all cells share a common scale. This is fit naturally and easily with survreg. E.g., survreg(cpt ~ vis * sea, data, 'weibull')

juniperlsimonis commented 6 years ago

roger that

ah! so what we want then is actually like what we have in pkm, right, where there's a separate equation for p and for k. here we'd want a separate equation for location and scale (unless we're using the exponential, then scale is fixed). that way, we could run cpm(formula_l = l ~ vis * sea, formula_s = s ~ 1, data, "weibull") and that should give us the same estimate as survreg(cpt ~ vis * sea, data, 'weibull') right?

ddalthorp commented 6 years ago

The likelihood functions need to be updated to include uncensored data (as, e.g., scavenging observed on camera). By my understanding, this simply involves substituting dsurvreg(t1) for psurvreg(t1 < T < t2) in the likelihoods for exponential and non-exponential models (e.g., https://en.wikipedia.org/wiki/Survival_analysis).

E.g.:

cpLogLik <- function(beta, t1, t2, dataMM, dist){
  beta <- matrix(beta, ncol = 2)
  Beta <- dataMM %*% beta
  psurv_t1 <- survival::psurvreg(t1, Beta[ , 1], Beta[ , 2], dist)
  psurv_t2 <- survival::psurvreg(t2, Beta[ , 1], Beta[ , 2], dist)
  psurv_t2[which(is.na(psurv_t2))] <- 1
  lik <- psurv_t2 - psurv_t1
  # lik[lik < .Machine$double.eps] <- .Machine$double.eps 
# what we want is to replace the collapsing integrals where t1 = t2 with f(t2) instead of S(t1) - S(t2) = 0
lik[t1 + sqrt(.Machine$double.eps) >= t2] <- survival::dsurvreg(t2, Beta[ , 1], Beta[ , 2], dist)
# (this is assuming error-checking that at least assures that t1 <= t2)

  nll <- -sum(log(lik))
  return(nll)
}

...and similar for the exponential version.

ddalthorp commented 6 years ago

"...separate equation for location and scale"

I think this is an open question. Right now it's programmed so that the shape and scale are subsumed under one model form, like formula = cpt ~ vis * season. I kind of like that, BUT an alternative model that we can check automatically is the default survival model (with common scale). Do we need to have the full two-formula option (so 5 x 5 x 4 models altogether in the case of two covariates)?

What does everyone think?

prabie commented 6 years ago

I think it would be nice to have that to be parallel to the pk model. I don't think it's a priority for beta version.

Ability to handle uncensored data seamlessly is a higher priority in my mind, and I'm not sure that even needs to be in the beta version.

On Mar 14, 2018, at 6:20 PM, ddalthorp notifications@github.com wrote:

"...separate equation for location and scale"

I think this is an open question. Right now it's programmed so that the shape and scale are subsumed under one model form, like formula = cpt ~ vis

What does everyone think?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ddalthorp/GenEst/issues/72#issuecomment-373218362, or mute the thread https://github.com/notifications/unsubscribe-auth/Acw22rfNeFU6dxAGsB_xHhgOBB9JOhSMks5tebO5gaJpZM4SZT_5 .

juniperlsimonis commented 6 years ago

77 (cpm) and #78 (rcp) provide the new flexible base functions that allow for any potential combination of covariates on both the location and scale parameters, all 4 distributions pretty seamlessly, and output on both the ppersist and survreg scales. I'm closing this issue here, and opening new ones to address the remaining steps in integrating the new cp functionality