DenisRustand / INLAjoint

Joint modeling multivariate longitudinal and time-to-event outcomes with INLA
17 stars 0 forks source link

How to predict for Two-part joint model for a longitudinal semicontinuous outcome and a terminal event #29

Closed bbb801 closed 5 months ago

bbb801 commented 6 months ago

Dear Sir or Madam,

I have fitted a Two-part joint model for a longitudinal semicontinuous outcome and a terminal event. But I got an error when making a prediction.

Here is my model: M8 <- joint(formLong = list(binary ~ Age + Sex + (1|id), Test_Result_Converted ~ Age + Sex + (1|id),

                        binary ~  Age  + Sex +  (1|id),
                        Test_Result_Converted ~ Age  + Sex +  (1|id),

                        binary ~  Age  + Sex +  (1|id),
                        Test_Result_Converted ~ Age  + Sex +  (1|id)),

        formSurv = inla.surv(survival_days, status) ~ Age + Atrial_Fibrillation + Diabetes +
          Height + Hyperlipidemia + Maligancy + Sex + Weight + stroke_category, dataSurv=dd,
        dataLong = list(w_Alkaline_Phosphatase,wPositive_Alkaline_Phosphatase,w_Sodium, wPositive_Sodium,w_Basophil, wPositive_Basophil), timeVar="lab_survival_days", 
        id = "id", corLong=TRUE, assoc = c("SRE_ind", "SRE_ind","SRE_ind", "SRE_ind","SRE_ind", "SRE_ind"),
        family = c("binomial", "gaussian","binomial", "gaussian","binomial", "gaussian"))

I have tried: dataSurv=dd dataLong = list(w_Alkaline_Phosphatase,wPositive_Alkaline_Phosphatase,w_Sodium, wPositive_Sodium,w_Basophil, wPositive_Basophil) predict(M8, newData=list(dataSurv,dataLong))

But I still have an error.

Could you give some guidance?

DenisRustand commented 6 months ago

Hi, Predictions are not yet implemented for this structure of association ("SRE_ind") but you can use "SRE" which should be the same for your model (since you only include random intercepts in each longitudinal submodel). Let me know if it fixes your error and if not, please provide some details on the error message. Best, Denis Rustand

bbb801 commented 6 months ago

Hi, Predictions are not yet implemented for this structure of association ("SRE_ind") but you can use "SRE" which should be the same for your model (since you only include random intercepts in each longitudinal submodel). Let me know if it fixes your error and if not, please provide some details on the error message. Best, Denis Rustand

Thank you. I can change 'SRE_ind' with 'SRE'. The question is how to write the codes for prediction after the training? My longitudinal data is: dataLong = list(w_Alkaline_Phosphatase,wPositive_Alkaline_Phosphatase,w_Sodium, wPositive_Sodium,w_Basophil, wPositive_Basophil) while survival data is: dataSurv=dd. So using predict function, is it to write down: predict(list(dataLong , dd))? I tried some codings, but fail to run. 1. So how to use predict function for modeling 'Model 14: Two-part joint model for some longitudinal semicontinuous outcomes and a terminal event'? 2. What is the predicted outcome from predict function? Is it the survival risk for the survival equation?

DenisRustand commented 5 months ago

Sorry for the delay, I'm out of office at the moment. I was preparing a tutorial code but I noticed some issues in the predict function that I have to fix, I'll do it ASAP and reply here once it is ready.

For your second question: we predict the linear predictor for each submodel (longitudinal or survival) and we can also internally transform these predictions to a desired scale (i.e., probability for binary part, survival probability for risk models, ...), I'll illustrate this for you.

bbb801 commented 5 months ago

Thank you very much. I am looking foward to your tutorial code.


From: DenisRustand @.> Sent: Thursday, 2 May 2024 15:42 To: DenisRustand/INLAjoint @.> Cc: LIU Jundong @.>; Author @.> Subject: [Ext] Re: [DenisRustand/INLAjoint] How to predict for Two-part joint model for a longitudinal semicontinuous outcome and a terminal event (Issue #29)

CAUTION: External email. Do not reply, click on links or open attachments unless you recognize the sender and know the content is safe.

Sorry for the delay, I'm out of office at the moment. I was preparing a tutorial code but I noticed some issues in the predict function that I have to fix, I'll do it ASAP and reply here once it is ready.

For your second question: we predict the linear predictor for each submodel (longitudinal or survival) and we can also internally transform these predictions to a desired scale (i.e., probability for binary part, survival probability for risk models, ...), I'll illustrate this for you.

— Reply to this email directly, view it on GitHubhttps://github.com/DenisRustand/INLAjoint/issues/29#issuecomment-2089817311, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMPCDZJKTV7UPWZFIRBAEL3ZAHU6PAVCNFSM6AAAAABGNQKVHSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBZHAYTOMZRGE. You are receiving this because you authored the thread.Message ID: @.***>

bbb801 commented 5 months ago

By the way, my data has missing values, and the model has successfully fitted with missing values. May I know how the missing values are handled in model?

DenisRustand commented 5 months ago

Hi,

Please find attached a code that illustrate how to use the predict function for the model you want. Make sure you download the latest INLAjoint version from GitHub.

I included 2 semi-continuous outcomes and 1 survival, so you can see how it works, in particular the argument "newData" for the longitudinal part should only receive 1 dataset that contains all the longitudinal information for the predictions, i.e., you should not provide a list with a dataset for each specific outcomes. If the measurement times match for all the longitudinal outcomes (as it is often the case), then you just have one column for each outcome. If not, just add more lines and set to NA the columns for the outcomes that are not observed at corresponding measurement times (I illustrated this in the code, although the measurement times matched in my case).

I added the argument "newDataSurv" for the survival part (in the example, this survival dataset is only used to get the value of the covariate that is not included in the longitudinal models (covariate: "sex")), it's also possible to just have the column for that covariate in newData instead but I wanted to illustrate that feature if you need it.

Just a remark but I'm using lognormal for positive values as it is common in two-part models to have right skewed positive values and using a log link is able to correct it, beyond the fact that Gaussian would allow for negative values while the outcome is strictly positive.

Missing values for the outcomes is fine, it just doesn't contribute to the likelihood but you cannot have missing values in covariates as I am unable to guess these values. I think the default behavior will assume the reference when a value is missing. I'll add a warning for these situations.

Please check and let me know if it works for you.

Best, Denis Predict_twopartJoint.txt

bbb801 commented 5 months ago

Dear Dr Denis,

Thank you so much for your detailed tutorial. I encountered no errors when running your code. However, when I split the data into training and testing sets and then trained on the training data, I received a warning, such as, 'Max id is 312 but there are only 311 individuals with longitudinal records, I'm reassigning id from 1 to 311'. I'm unsure if this will affect the relationship between the ID and the data in the training and test sets, potentially causing issues. By the way, do you have suggestions about calculating performance or metrics that can be used based on your package's predicted results? For example, dynamic AUC.

image

Regards, Beato

DenisRustand commented 5 months ago

Hi Beato,

I need the id to have contiguous values between 1 and the number of groups, in your case it seems like one is missing. I automatically reassign it but I'd suggest to do it yourself to avoid any risk of id mismatches. So it you have individual id 1, 2 and 4, you need to assign ids 1, 2 and 3 to avoid this warning.

For the evaluation of the performance of the predictions, I have not implemented any metrics yet as I'm not done yet with the development of the predict function (for now, we can predict one or multiple survival from one or multiple longitudinal, I want more flexibility so we can predict survival from survival, longitudinal from survival, etc.). I don't know if I'll implement any predictive accuracy metrics actually, since the usual approaches (Brier score, AUC, EPOCE) can be applied and are likely already available in some other packages.

Best, Denis Rustand

bbb801 commented 5 months ago

Hi,

Please find attached a code that illustrate how to use the predict function for the model you want. Make sure you download the latest INLAjoint version from GitHub.

I included 2 semi-continuous outcomes and 1 survival, so you can see how it works, in particular the argument "newData" for the longitudinal part should only receive 1 dataset that contains all the longitudinal information for the predictions, i.e., you should not provide a list with a dataset for each specific outcomes. If the measurement times match for all the longitudinal outcomes (as it is often the case), then you just have one column for each outcome. If not, just add more lines and set to NA the columns for the outcomes that are not observed at corresponding measurement times (I illustrated this in the code, although the measurement times matched in my case).

I added the argument "newDataSurv" for the survival part (in the example, this survival dataset is only used to get the value of the covariate that is not included in the longitudinal models (covariate: "sex")), it's also possible to just have the column for that covariate in newData instead but I wanted to illustrate that feature if you need it.

Just a remark but I'm using lognormal for positive values as it is common in two-part models to have right skewed positive values and using a log link is able to correct it, beyond the fact that Gaussian would allow for negative values while the outcome is strictly positive.

Missing values for the outcomes is fine, it just doesn't contribute to the likelihood but you cannot have missing values in covariates as I am unable to guess these values. I think the default behavior will assume the reference when a value is missing. I'll add a warning for these situations.

Please check and let me know if it works for you.

Best, Denis Predict_twopartJoint.txt

Dear Denis,

I have a question and an error in your code. I dont understand the purpose of setting all longitudinal data be NA and then combine them by row.

(1) Is it to predict the future 14 steps (horizon=14) longitundinal data and survival outcome using the current longitudinal and survival data ?

ND1 <- Longi[Longi$id==8,]
ND2 <- Longi2[Longi2$id==8,]
ND1$serBilir <- NA
ND1$binary2 <- NA
ND2$albumin <- NA
ND2$binary <- NA
ND <- rbind(ND1, ND2) # newdata
NDS <- Surv[Surv$id==8,]

(2) Why shoud we use 'ND <- rbind(ND1, ND2)' instead of 'ND <- cbind(ND1, ND2)'?

The first case has 16 rows and there are non-missing longitudindal data. image

The second has 8 rows and all longitudinal data are missing values. image

When I used cbind, it seems I still had the same results. I am unsure if the two formats to make differences in predictions. When there are non-missing values in the prediction function 'newData', will the predict function use them and apply them in the predicted results or just ignore them?

Regards, Beato

DenisRustand commented 5 months ago

If your new dataset for prediction only contains NA for the outcomes, you'll get predictions for the average in the population conditional on covariates. If there are values, the predictions are conditional on these values.

As I tried to explain, I have used rbind to illustrate how to account for multiple markers measured at different time points. In my case, I suppose the 8 measurement time points (i.e., 0, 0.5092542, ...) are different between the two longitudinal markers, so it is not possible to set them on the same line as each line is associated to a unique time point. Also as I mentioned, in this example these 8 time points are the same for the two markers, so of course in this case you want to have only 8 lines instead of 16, this rbind was just to illustrate how to do when measurement times are not matching.

This is in the case you want to make prediction conditional on a new individual with 8 observations per marker. In case you want predictions for the average in the population conditional on covariates, then you don't have time points and observations, just need 1 line per scenario in the new dataset, not 8 lines. If you want to predict over your training dataset, just call predict over the model without any argument (P <- predict(M14)) and you'll get the linear predictor for each data point in the training dataset.

It depends on what you mean when you say you want to make predictions...

Best, Denis Rustand

bbb801 commented 5 months ago

Thank you, Dr Denis Rustand. Please correct me if it is wrong. 1. Based on your explanation and running the code, your prediction function is making predictions from time point 0 to the horizon (=14) regardless of the time points provided in the data. When I input longitudinal biomarkers with a few non-missing values across multiple time points plus its baseline data in the prediction function, the prediction function will use this information for prediction. 2. The prediction function should recognize the id and link information if it is already in the training set.

DenisRustand commented 5 months ago
  1. Correct, the prediction is computed in an interval [A, B] where the default for A is 0 but can be changed to any value with argument "Csurv" in the call of the predict function. B is defined by the argument "horizon". The number of points between A and B at which predictions are evaluated is defined by argument "NtimePoints", default is 50 equidistant points. You can define your own set of points with argument "timePoints" instead of using the default equidistant points. The predictions are conditional on baseline covariates and longitudinal biomarkers measurements from the new dataset provided in the call of the predict function (when provided, and when biomarker is NA, predictions are only conditional on baseline covariates).
  2. Yes, these informations come from the fitted model.