CecileProust-Lima / lcmm

R package lcmm
https://CecileProust-Lima.github.io/lcmm/
54 stars 13 forks source link

complete data? covariates? transform Y axis to natural scale? #225

Closed stzhf86 closed 9 months ago

stzhf86 commented 11 months ago

Dear all,

I am working with the R-package ‘lcmm’, and I have several quick questions. Could you please help me?

Suppose the variable of repeated measurements is Y, and X is the follow-up time. X is organized to be 1 to 6 months. Each patient could have one measurement for each month.

However, the real data is not complete. Some patients do not be hospitalized again during the 6 months. So some patients could only have 3 measurements, like at 1th, 3th, 5th months. That is, the dataset is incomplete and has missing values in Y.

I built one model like: m2<-hlme(Y~ poly(X, degree=2, raw=TRUE), mixture=~ poly(X, degree=2, raw=TRUE), subject = "id", random = ~ -1, ng = 2, data = tmp)

So, my questions are:

  1. Functions “hlme” or “lcmm” do not need complete data for the outcome variable Y, right? The incomplete dataset could directly be inputted into the function?
  2. In this real dataset, missing values in Y has reasons. There is one variable (i.e., Z) that could be related to this. So I am thinking, do I need to include Z into this model? And Where I should include it? m2<-hlme(Y~ poly(X, degree=2, raw=TRUE) + Z, mixture=~ poly(X, degree=2, raw=TRUE), subject = "id", random = ~ -1, ng = 2, data = tmp) Is that correct?

About the plot, I have the other five questions: Suppose I built a model "m5spl" with a link function with 5 equidistant knots.

  1. I used plot(m5spl) to plot the predictions and observations, however, the scale is link function-transformed. Is there a way to show the plot with Y-axis in its natural scale?

  2. To solve the 1st problem above, I tried to use the function predictY() to get the predictions, and manually calculate the weighted mean of observations in one plot. The aim is to show the matching of predictions and observations. Is this correct?

  3. when manually calculate the weighted mean of observations, Suppose one time point ‘T1’ and one specific class 'Class1', this weighted mean is calcuated on all the patients (not just those finally allocated to Class 1), and use their posterior probability to allocate to Class1. Is this correct?

  4. when I get predictions from predictY, I use plot() to get trajectory. Is that the same I plot predictions using ggplot: geom_smooth(method='loess')?

  5. I checked the predictions trajectories (as mentioned above), and I used the geom_smooth(method='loess') to show the trajectory of observations, However, these two plots could be quite different. I am wondering, which one usually should be reported in the paper? Is it appropriate to show the second one (the real values)?

Sorry for so many questions.

Thank you in advance.

Looking forward to your help.

Best Wishes, Fan

CecileProust-Lima commented 11 months ago

Hi Fan, regarding the modeling questions:

  1. Functions “hlme” or “lcmm” do not need complete data for the outcome variable Y. It takes only the observations where they are, whatever the times, whatever the number of measurement times per subject. The incomplete dataset could directly be inputted into the function: yes, the program will select the observations to be analyzed and report the deleted observations in the summary.
  2. The model works under the missing At Random principle. If you think one variable is of importance, then you should adjust on it. Be careful though to consider interactions with time if you assumption is that subjects with different levels of Z may have different shapes of change over time.

Regarding the plot questions:

  1. the function to use for the natural scale is predictY and then plot on this predictY object. Please refer to the vignettes for this.
  2. PredictY will only provide the marginal predictions, not the individual predictions. For the individual predictions, you need to use predictYCond that will just back transform a prediction you have in the transformed scale, for instance in fitted. Except from this, you can then indeed weight as you wish using the posterior probabilities
  3. In the weighted predictions, all the subjects contribute with the level of their posterior probabilities. Please refer to papers for the formulas, for instance: https://doi.org/10.1177/0962280212445839
  4. you should not use geom_smooth(method='loess')with longitudinal and incomplete data. This method does not handle missing data and correlation. you should rely on the model predictions.

You will find more information in the vignettes and scientific papers referenced in the package. Best Cécile

stzhf86 commented 11 months ago

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

Dear Cécile,

Thanks a lot for removing most of my problems.

I still have one following question: how to claculate the “obs.class1” or “obs.class2” in the output of the function plot(which = ‘fit’)?

Suppose, a simple model:

Fit <- hlme(fixed = X ~ Time, mixture = ~Time, random = ~-1, subject = "ID", ng = 2, nwg = F, data = tmp),

I got the probabilities (Prob1 and Prob2) and final trajectory_classes from the output: Fit$pprob, and organize the dataset in a long format as below.

ID | Time | Trajectory_class | X | Prob1 | Prob2 -- | -- | -- | -- | -- | -- 1 | 1 | 1 | 1 | 0.8 | 0.2 1 | 2 | 1 | 2 | 0.8 | 0.2 1 | 3 | 1 | 3 | 0.8 | 0.2 2 | 1 | 2 | 4 | 0.3 | 0.7 2 | 2 | 2 | 5 | 0.3 | 0.7 2 | 3 | 2 | 6 | 0.3 | 0.7 3 | 1 | 1 | 7 | 0.1 | 0.9 3 | 2 | 1 | 8 | 0.1 | 0.9 3 | 3 | 1 | 9 | 0.1 | 0.9 4 | 1 | 2 | 10 | 0.4 | 0.6 4 | 2 | 2 | 11 | 0.4 | 0.6 4 | 3 | 2 | 12 | 0.4 | 0.6

 

Then, I think one way to manually calculate obs.class1:

In each time point such as Time=1,

obs.class1 = sum(Prob1 * X) / sum(Prob1)

     =(1 * 0.8 + 4 * 0.4 + 7 * 0.1 + 10 * 0.4) / (0.8 + 0.4 + 0.1 + 0.4)      

 

Is this correct?

In my real data, what I calculated by this aforementioned formula, is close to but still different from the obs.class1 that is from predict(which = ‘fit’)?   I have noticed the break.times, and set them to be exactly 1, 2, 3. Because in my real data, the time point is exactly 1, 2, 3.

But the output is still different.

 

Your help is really appreciated.

Thanks again.

 

Best Wishes,

Fan

VivianePhilipps commented 10 months ago

Hi,

yes, the formula is correct. In your numerical application, you give the wrong probability to the second subject (it's 0.3 instead of 0.4).

Viviane

stzhf86 commented 9 months ago

Hi Viviane,

Sorry for the delay response.

I would like to thank you to clear all my questions.

Really appreciate your help.

Best Wishes, Fan