chjackson / flexsurv

The flexsurv R package for flexible parametric survival and multi-state modelling
http://chjackson.github.io/flexsurv/
53 stars 28 forks source link

Missing or infinite values in "X" when fitting model? #142

Closed AustinFournierKL closed 1 year ago

AustinFournierKL commented 1 year ago

When trying to fit a flexsurvmix model to my data, I'm getting the message:

Error in eigen(inv_hessian, symmetric = TRUE, only.values = TRUE) : 
  infinite or missing values in 'x'

This is a little bit odd to me because I can confirm that the distribution of wait times definitely only contains finite numbers, with no NAs, negatives, or infinite values. In term of the other inputs to flexsurvmix, the "event" column does have NAs for censored events, but then I thought it was supposed to for censored values; the status column only contains 1s and 0s. There are a lot of censored data points in this case since I'm working with a very new data set; about 9/10ths of the data is censored at this point in time, but I'm not sure if that should prevent model fitting or just render the results more suspect. With this in mind, do you have any idea what this error message could be referring to?

I can't post the data set I'm using, but if it helps the part of my code after data cleaning is as follows.

nursery_model = flexsurvmix(Surv(wtime, status) ~ 1, event = event, dists = c(Bad_Opportunity = "weibull", Signup = "weibull"), data = x3)

Also, there was a previous version of the error where it was fitting the model and giving me results, but with the warning that things might be inaccurate since the hessian was non positive definitive. It switched to this version of the error without me making any changes I can think of (I was coming back to the script to try to fix that error when this one popped up).

chjackson commented 1 year ago

The x here doesn't refer to your data but the matrix that eigen is trying to get the eigenvalues of, that is, the inverse of the Hessian at the parameter estimates. If this has infinite values, that suggests your model is not identifiable from the data, in other words, the likelihood is a flat function of one or more of the parameters. Usually in these cases I'd simplify the model until it converges, then build the model back up to identify what the problematic part of the model is.

A non positive-definite Hessian is a similar issue - it usually means the claimed "solution" is not actually the maximum likelihood. In those cases, there may or may not be a meaningful true maximum of the likelihood, and this is often a sign that the model should be simpler. If the reported confidence intervals are extremely large, that is also a red flag.

(The error message here was returned by eigen, a base R function, not one of flexsurv's functions - I'll guard against this happening in a future version and put a proper error message in flexsurv.)

AustinFournierKL commented 1 year ago

Hmm, that's a bit troublesome, but I suppose it is what it is; I do have some rough expedients I can apply to reduce model complexity a bit.

I wouldn't have expected a two-component Weibull mixture to be too complicated on its own, so do you think that it's failing in part because there are so few non-censored observations (~30 of 300)? I'm curious as to whether this is something I'll never be able to do for some fundamental structural reason, or whether I'll be able to revisit this problem with the analysis I was originally planning to use once more data comes in.

chjackson commented 1 year ago

Yes - censoring basically reduces the amount of information, so, depending on how early the censoring is, there might not effectively be very much more than 15 data points per mixture component. Especially if you don't know what the eventual event would have been in the censored cases.

AustinFournierKL commented 1 year ago

Okay. Incidentally, I tried fitting another mixture model to a much more complete data set and got this:

Error in optim(inits_opt, loglik_flexsurvmix, hessian = FALSE, method = "BFGS",  : 
  initial value in 'vmmin' is not finite

Is that the same type of issue, or is that something I can solve by setting a manual start for parameters or something?

AustinFournierKL commented 1 year ago

Update: I'm not sure if I should close this issue and open a new one at this point, but since it seemed like the initial values were the problem I tried it and got this far:

cs_model = flexsurvmix(Surv(wtime, status) ~ 1, event = event, dists = c(Bad_Opportunity = "weibull", Signup = "weibull"), data = x1, inits = c(Bad_Opportunity = function(t){ c(1, median(t)) }, Signup = function(t){ c(1, median(t)) }))
Error in inits[[k]][1:nthetal[k]] : 
  object of type 'closure' is not subsettable

Any hints on how I'm messing up the syntax here?

chjackson commented 1 year ago

You're using the syntax for custom distributions - functions defining heuristics to obtain initial values from the data. For built-in distributions, the initial values should just be numbers giving the actual initial values for your model fit (a list of numeric vectors in the case of flexsurvmix).

The vmmin error message means the likelihood of the data is zero, or log-likelihood is -Inf, at the initial values. It's not necessarily solvable by changing the initial values for the parameters. It often indicates something extreme in the data somewhere. I would usually diagnose this by setting fixedpars=TRUE in the model fit so that it simply calculates the likelihood once, instead of trying to maximise it. Then repeat that for subsets of the data and/or different initial values. If the returned likelihood is finite, there is no problem with that part of the data / inits.

AustinFournierKL commented 1 year ago

Understood, thank you.