chjackson / msm

The msm R package for continuous-time multi-state modelling of panel data
https://chjackson.github.io/msm/
57 stars 17 forks source link

A question about very high total times using msm.totlos #89

Closed taledv closed 9 months ago

taledv commented 10 months ago

Hello Dr. Jackson,

First thank you for the package, this is great.

I have a question regarding some phenomena I encounter when using the package.

I have a simple 3-state data, with recovery (health-ill-dead), using only age as a covariate. After fitting the data, I see for example for age=60, a sojourn time of around 26 years for State=1 (Healthy) and around 3.5 years for State=2 (ill), which are a bit high, but reasonable. When sojourn time is computed for state=1, the time is 1/(q(1,2)+q(1,3)), so the sum of these two q's is around 1/26, but q(1,3) is very very small (0.0029), which I think causes the very unreasonable times when using msm.totlos. (see below the full Q matrix)

When I use the msm.totlos. I get 126 years for State=1, which is quite unreasonable. (a 60 years old should spend healthy more than 100 years) So I suspect its something with death times (or particularly with moving from healthy to dead 1->3. One thing I should mention is that I have exact death times for only 54% of the population.

I have a few ideas and would like to hear you opinion to try and solve my "bug":

  1. Move from 1 to 3 is very rate -> causes q(1,3) to be very small (but this is always the case, people often turn ill before dying, so I don't think its a valid reason.
  2. I'm mis-using the package (censor/death times/ some other parameters) (example of the usage is below) -> for example maybe my time column is wrong, I start at zero for every "id" at his/her first sample, which seems right.
  3. My data is "bad" -> maybe for some reason people always get ill before dying, so a possible solution is "raise the bar" for the ill state, to increase 1->3 transitions, but this also doesn't seem fully right.
  4. msm can't work with only ~54% dead, it needs all of the population to be dead, because for all who didn't die it takes the death time to infinity or something -> seems unreasonable as of my understanding the algorithm just takes time between states, it doesn't "invent" times.
  5. When I increase age by one year, the total times decreases by ~15 years for each year, which is also odd, it should decrease by around 1 year., even maybe less, but probably part of the same problem (see below total times for ages 60,61 and 62)

In conclusion, I see that q(1,3) is very very small which probably causes total times for state=1 to be very very large, and I can't seem to understand why.

Im sorry for the long description, and would really appreciate your help. Thank you very much, Tal

The call to the package: q <- 0.001 Q <- rbind(c(0,q,q), c(q,0,q),c(0,0,0)) model1 <- msm(state ~ time, subject=id, data = df_hrs, qmatrix=Q, gen.inits=TRUE, covariates=~age, center=TRUE, control=list(reltol=1e-8, maxit=10000, fnscale=100000), method="BFGS", deathexact = 3)

Q matrix for age=60: q(1,1) = -0.038532 (-0.039586,-0.03751)
q(1,2) = 0.035627 ( 0.034593, 0.03669) q(1,3) = 0.002905 ( 0.002654, 0.00318) q(2,1) = 0.247676 ( 0.239192, 0.25646) q(2,2) = -0.288173 (-0.297293,-0.27933) q(2,3) = 0.040497 ( 0.038113, 0.04303) q(3,3) = q(3,2) = q(3,1) = 0

Sojourn times for age=60: State 1 25.952266 0.35739870 25.26115 26.662294 State 2 3.470139 0.05516734 3.36368 3.579967

Total times for age=60: State 1 126.388253754517 State 2 15.625380531923 State 3 Inf

Total times for age=61: State 1 111.109880320523 State 2 14.7650147181606 State 3 Inf

Total times for age=62: State 1 97.7306018941964 State 2 13.9446723440259 State 3 Inf

chjackson commented 10 months ago

How are you invoking totlos.msm()? If you're doing e.g. totlos.msm(model1, covariates=list(age=60)) I would expect the bias you are seeing, because this holds age constant at 60, so it assumes the predicted individual has the risk of a 60 year old for the rest of their life.

If there is a time-dependent covariate, as age in your example, it does not automatically know when the values of that covariate change during a prediction. You have to tell it explicitly, e.g. that year of age changes once every year, using the piecewise.times and piecewise.covariates arguments. See the course notes for more examples with time dependent covariates.

chjackson commented 10 months ago

The model also assumes a linear effect of age, which is estimated from the data. This won't necessarily fit the data well. Even if it does fit, you would only be able to make predictions within the range of ages in your data.

taledv commented 10 months ago

Thanks for the quick answer! I'm indeed using totlos.msm(model, covariates=list(age=60)) to get the results I mentioned. Just to be sure, my call to msm(): q <- 0.001 Q <- rbind(c(0,q,q), c(q,0,q),c(0,0,0)) model1 <- msm(state ~ time, subject=id, data = df_hrs, qmatrix=Q, gen.inits=TRUE, covariates=~age, center=TRUE, control=list(reltol=1e-8, maxit=10000, fnscale=100000), method="BFGS", deathexact = 3)

as I wrote is ok? only my call to "totlos.msm" is not ok since age is time dependent? how would you change it? and why for sojourn it is ok to just call sojourn.msm(model, covariates=list(age=60))? BTW, it is ok that not all of my population end up with state=3, right? I would really like to try it even now to see if it works, if its too complicated to know what the inputs to totlos.msm should be I'll just read you course notes and figure it out, thank you for those. As I understand it, every function that needs t in its input, like pmatrix and totlos needs this added info about time-dependent covariate. But what about time-dependent covariate that change not-deterministically over time, like "has cancer", does the model have answer to those also? sorry for the added questions and clarifications. Thanks again for the quick reply!

chjackson commented 10 months ago

I can't see anything immediately wrong with how you are using msm(), but I'm not familiar with your application so don't take that as a guarantee. It's OK that not everyone reaches state 3.

Time-dependent covariates aren't supported by sojourn.msm, as documented in its help page. The call you have given will fix age at 60 throughout the sojourn.

The course notes give an example that you should be able to adapt easily.

taledv commented 10 months ago

Tried it, using ages from 60 to 120, and times from 1 to 60, and im getting now in total times: State 1 13.4970282836508 State 2 2.90096488255332 State 3 Inf Which is less than what im getting at sojourn times, but as I'm seeing your reply and as I've realized even beforehand I understand than any estimation needs to know how the covariates change over time, and when they change "unexpectedly", not in the simple way of age that I can just add one year of time for every increase of 1 of age, it needs the "joint model". So sojourn doesn't take the arguments as totlost, and can't be computed for time-dependent covariates? BTW, 16 years (13.49+2.9) that im getting for a 60 year old is disappointing :) Lastly, I think it may confuse some to see in the help page of sojourn that you provided the list (age = 60, sex = 1) input to the covariates, because as you said, it gives wrong result because it keeps the person at age 60, even though of course as he/she ages the transitions probabilities change and effect the time.

Thanks again!

taledv commented 10 months ago

Hi again Dr. @chjackson ,

First, I really appreciated your help so far, I don't take it for granted, it really helped me a lot!

Hopefully this is my last issue with totlos.msm, but I have one more problem with it.

As mentioned before, the total times (State1+State2) are a bit short, this can be due few reasons.

When taking only people who died (around 86% of the data), and taking their age at death and raising it by 20 years (also the time vector at death states), this should enlarge the total times of State1+State2 by around (or even exactly) 20 years!. Because if every person lives another 20 years so for every age the total expectancy time should be raised by 20 years. BUT it raises only by around 10 years, as if it just ignores the time between the last active state (State 1 or 2) and death (State=3).

I assume it has something to do with how I define the model or use the totlos.msm, this is how I do it:

Mode fitting: q <- 0.001 Q <- rbind(c(0,q,q), c(q,0,q),c(0,0,0)) model <- msm(state ~ time, subject=id, data = df_hrs, qmatrix=Q, gen.inits=TRUE, covariates=~age, center=TRUE, control=list(reltol=1e-8, maxit=10000, fnscale=100000), method="BFGS", deathexact = 3)

Calling totlos: age_list <- lapply(seq(65, 160, by=0.1), function(age) list(age = age)) time_vector <- seq(delta_t, length.out = length(age_list) - 1, by = 0.1) output_list = totlos.msm(model, piecewise.covariates = age_list, piecewise.times=time_vector, start=1)

Thank you again, your help would really be appreciated, Tal

chjackson commented 10 months ago

In the call to totlos.msm, check delta_t is defined appropriately - I think this should be 160 - 65 - 0.1? But there's a lot of details I can't see here, e.g.

I won't have time to look in depth, sorry. I can only suggest that you break the analysis down into smaller steps, and make sure that you know what each step is doing mathematically, hence whether the results make sense.

taledv commented 10 months ago

Thank you for the quick response, my time vector where I input state ~ time, starts at 0 for every first session of a person. So I thought that also the time vector should start from zero. I think this is correct, but I may be wrong. I created the time vector from the age vector, just subtructed the first age of each person per person, so it will always start from 0. My manipulation of the data that contains only persons that died eventually (86% of the data) is very simple: data.loc[data['state']=3, 'age'] += 20, and also for the time column the same, data.loc[data['state']=3, 'time']+=20, so adding 20 years to the death age. When I'm doing a very simple Linear regresion, where my X is "age at first session" and y is "age at death", it does add those ~20 years to age 65 from around ~13 to around ~33 years. But when using the multi-state model as described it adds "only" around 10 years, to 23 years. My data is quite simple, I have a few samples per person, with states 1, 2 and eventually last state 3 is death. Seems to me that if I add X years of "living" and it doesn't matter if in state 1 or state 2, it should add around X years to the total time.

taledv commented 10 months ago

Hi Dr. @chjackson ,

I hope this is my last question, and it is a real quick one. As said before I have 3 states, where state=3 is death, and I'm using deathexact = 3 in the model fitting, as I have exact death dates.

My question is, how to impute the time-dependent covariates where State=3, because of course I don't know the cancer or arthritis condition or anything regarding the person at time of death, the only thing I know is when the person died. Does the model even take those covariate values into account in rows where State=3?

I have two options as I see:

  1. Take the covariate values from previous session, they will resemble the values at death, best estimation.
  2. Put just NaNs in the time-dependent covariates at State=3 (don't know if the model will "love" this) If the model ignores the covariate values at death, so it really doesn't matter what option I choose. Thank you, Tal
chjackson commented 10 months ago

The covariate value specified at the last observation for a subject is not used by msm, so it can be anything.

msm assumes that the covariate value at any observation is constant until the next observation. The likelihood for (state at observation j | state at observation j-1) is calculated using the covariate value at observation j-1. So the value at the final observation is never used.

taledv commented 10 months ago

This is great, this is what I thought!!, makes total sense, thank you!, so even if the person didn't die, it doesn't take the covariates at the last observation for the person, great. I guess it DOES take the "time" from last observation, because it is important to know when the person died, but it doesn't take the covariates, which make total sense. Thank you very much!

taledv commented 9 months ago

Hello again Dr. @chjackson , I hope everything is fine, I have a question regarding the formula in page 25 in: https://cran.r-project.org/web/packages/msm/vignettes/msm-manual.pdf. (also detailed in page 115 in https://cran.r-project.org/web/packages/msm/msm.pdf) It is said there: "The expected total length of stay in state j between times t1 and t2, from the point of view of an individual in state i at time 0, is defined by the integral from t1 to t2 of the i, j entry of the transition probability matrix P(t)" So if i replace i with 1, and j with 2, and t1=0, t2=inf, it is said that the total time at state 2, starting from state=1 at t=0, only depends on the transition probability P1,2(t), which confuses me, because you can get to state 2 from other states too, so it must depend also on, for example, P(3,2)(t) if this transition is possible, or more generally any P(i,2)(t), not just P(1,2)(t). Can you explain this?, if this is the actual computation, as I understand now, this is not the total time spend at state=2 (in my example), but only part of it, the part the starts always from state=1 and moves to state=2.

chjackson commented 9 months ago

$P_{1,2}(t)$ is the probability that someone in state 1 at the current time (0 say) will be in state 2 at time $t$ later. It doesn't say anything about what states they were in between time 0 and time $t$. Remember these are continuous-time models where transitions can happen at any time, not just at discrete time units.

taledv commented 9 months ago

Thank you, I understand. I guess I find the integral odd that it gives the total time in state 2 from all states, not just from state 1, but it is dependent only on p12, but I'll live with it :) BTW, I tried summing p12, to get the total time spent at state=2, and got lower number, but maybe my estimation of the integral is not that accurate (took deltaT=0.1) Thanks!