CecileProust-Lima / lcmm

R package lcmm
https://CecileProust-Lima.github.io/lcmm/
55 stars 13 forks source link

Missing data and validating class membership in new data set #93

Closed kritchaiv closed 2 years ago

kritchaiv commented 3 years ago

Dear Dr Proust-Lima,

I hope you are well and many thanks for writing this package. I found it to be really fascinating and useful. However, there are some questions that I cannot find the answers to online and hope that you will be able to shed some light on this.

I'm working on a data set of 400 patients in ITU. Each day, a SOFA score was calculated based on their clinical status. Patients were followed up for 28 days. However, if they were discharged or passed away, the recording stopped. I have followed the example from this page: https://cran.r-project.org/web/packages/lcmm/vignettes/latent_class_model_with_hlme.html

My code is: m3 <- gridsearch(hlme(SOFA score ~ Date, random =~ Date, subject = 'ID', data=data.set, ng = 3, mixture=~Date), rep=100, maxiter=30, minit=m1)

My questions are:

  1. From one of the past questions, you mentioned that missing data were treated as MAR. How does this impact the overall trajectory of the class? Since many of my subjects left ITU by Day 10, will it make more sense to limit my timeline to Day 10 instead of the full 28?

  2. Once I have done an analysis and obtained class membership on an existing dataset, is it possible to use this to predict the class membership from a different dataset that uses a similar parameter? For instance, there is another database that tracks SOFA scores among patients in ITU. Is there a way to apply the model derived from this existing database to another to validate it?

I would really appreciate your help on this.

Thank you,

Best wishes, Jay

Edit: questions were modified as I found some answers to the previous questions.

CecileProust-Lima commented 3 years ago

Hi, sorry for the delay. To respond to your questions:

  1. Missing data: MAR means that the dropout can be predicted from the observations. If you think this assumption may be plausible, then your analyses will be robust to the missing data mechanism. You may want to censor information after Day 10 if you wish but this is not necessary I think.
  2. There is a function for this called predictClass. It is available in versions >= 1.9.3

I hope it helps Cécile

kritchaiv commented 3 years ago

Dear Cécile,

Thank you so much for your help. I really appreciate it.

Please may I ask for your help for the following as well?

  1. I'm now trying to make a multivariate model using the multlcmm function, which works brilliantly for ng=1 model and ng=2-4 when I do not use the gridsearch function. However, when I accessed my university's cluster computer to help with computing (it took 5 hours to run m1) so I could use the gridsearch (rep=100, maxiter=30), the models for ng=2-4 only showed 1 100% class and their BIC values became 2x10^9.

This is the code I used to generate m1: m1all.beta <- multlcmm(Total + Resp

For subsequent gridsearch functions, I used: m2all.beta <- gridsearch(multlcmm(Total + Resp + Coag + Liver + Cardio + Renal ~ Date, random =~ Date, subject = 'ID', data=SOFA_All, ng = 2, mixture=~Date, link = 'beta', randomY = TRUE, cor = BM(Date)), rep=100, maxiter=30, minit=m1all.beta)

I tried removing 'randomY = TRUE, cor = BM(Date)' but the results were the same.

  1. I've also tried to apply this analysis to different parameters. Instead of SOFA scores, I'm using it on some metabolites of interest over time. When I run the hlme function (below), it works perfectly fine.

m1lactic.acid.linear <- hlme(Lactic.acid ~ Date,random =~ Date, subject = 'ID', data = VANISH_NMR)

However, when I used lcmm so I could include the link function beta, the following error appeared:

m1lactic.acid.beta <- lcmm(Lactic.acid ~ Date, random=~ Date, subject='ID', data=VANISH_NMR, link='linear') Error in str2lang(x) : :1:10: unexpected symbol 1: ~ Sample ID ^ I had ensured that there were no symbols on the 'ID' column but the problem persisted.

Your advice will be much appreciated.

Many thanks in advance for your kind help.

Best wishes, Jay

On Tue, Sep 28, 2021 at 1:59 PM Cécile Proust-Lima @.***> wrote:

Hi, sorry for the delay. To respond to your questions:

  1. Missing data: MAR means that the dropout can be predicted from the observations. If you think this assumption may be plausible, then your analyses will be robust to the missing data mechanism. You may want to censor information after Day 10 if you wish but this is not necessary I think.
  2. There is a function for this called predictClass. It is available in versions >= 1.9.3

I hope it helps Cécile

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CecileProust-Lima/lcmm/issues/93#issuecomment-928910685, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVO5V3JVN3ENH3WCBRZWPNTUEFRVDANCNFSM5DLNZK7A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

VivianePhilipps commented 3 years ago

Hello,

for your issue with the gridsearch function, maybe there are different installations between your computer and the cluster. Did they use the same version of R and the same version of lcmm? When running a 2 classes model without gridsearch on the cluster, do you get the same result?

For the error on lcmm, I cannot where it happens. Could you run the traceback() command just after the error occurs? This will help to see which code line causes the error, as we do not explicitly use the str2lang function.

Best,

Viviane

kritchaiv commented 3 years ago

Dear Viviane,

Many thanks for your reply and my apologies for the late response.

Yes it is the same installation and version of R and lcmm. There is no issue when running 2 classes model without the gridsearch. I can go up to 5 classes with no problems. The issue seems to arise only when I used the gridsearch function.

As for the error, it was apparently due to listing and the issue was fixed when the data was unlisted.

Many thanks for your kind help.

Best wishes, Kritchai

On Fri, Oct 8, 2021 at 9:28 PM VivianePhilipps @.***> wrote:

Hello,

for your issue with the gridsearch function, maybe there are different installations between your computer and the cluster. Did they use the same version of R and the same version of lcmm? When running a 2 classes model without gridsearch on the cluster, do you get the same result?

For the error on lcmm, I cannot where it happens. Could you run the traceback() command just after the error occurs? This will help to see which code line causes the error, as we do not explicitly use the str2lang function.

Best,

Viviane

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CecileProust-Lima/lcmm/issues/93#issuecomment-938688327, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVO5V3OV3SKB7432LPBFZBLUF352RANCNFSM5DLNZK7A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

kritchaiv commented 3 years ago

Dear Viviane,

I hope you are well.

I tried running the gridsearch again and it said the following error:

1: In rmvnorm(n = 1, mean = bb, sigma = vbb) : sigma is numerically not positive semidefinite

Please let me know if this helps narrow down the issue.

Once again, thank you so much for your kind help.

Best wishes, Jay

On Fri, Oct 15, 2021 at 11:18 PM KV @.***> wrote:

Dear Viviane,

Many thanks for your reply and my apologies for the late response.

Yes it is the same installation and version of R and lcmm. There is no issue when running 2 classes model without the gridsearch. I can go up to 5 classes with no problems. The issue seems to arise only when I used the gridsearch function.

As for the error, it was apparently due to listing and the issue was fixed when the data was unlisted.

Many thanks for your kind help.

Best wishes, Kritchai

On Fri, Oct 8, 2021 at 9:28 PM VivianePhilipps @.***> wrote:

Hello,

for your issue with the gridsearch function, maybe there are different installations between your computer and the cluster. Did they use the same version of R and the same version of lcmm? When running a 2 classes model without gridsearch on the cluster, do you get the same result?

For the error on lcmm, I cannot where it happens. Could you run the traceback() command just after the error occurs? This will help to see which code line causes the error, as we do not explicitly use the str2lang function.

Best,

Viviane

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CecileProust-Lima/lcmm/issues/93#issuecomment-938688327, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVO5V3OV3SKB7432LPBFZBLUF352RANCNFSM5DLNZK7A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

VivianePhilipps commented 3 years ago

Hi,

it looks like that you have negative variances in the initial model. Did you check the convergence of m1all.beta? If it converged (ie m1all.beta$conv = 1) could you please look at VarCov(m1all.beta) if there are any negative or infinite values.

Viviane

kritchaiv commented 3 years ago

Dear Viviane,

Many thanks for your reply.

The conv for the model is 2 and the VarCov function shows some negative values as seen below:

Date cholesky 1 cholesky 2 std.err 1 Date 7254.26990 -260.752726 -999.87589 -3.785068e+02 cholesky 1 -260.75273 6781.547334 -1045.86079 -3.642369e+01 cholesky 2 -999.87589 -1045.860786 8571.74560 -1.512541e+03 std.err 1 -378.50675 -36.423692 -1512.54072 2.368947e+04 std.err 2 -18.88546 13.484646 -66.95903 -2.372858e+02 std.err 3 86.78468 -8.499362 88.70477 1.876207e+02 std.err 4 64.59874 13.679482 53.14490 1.701954e+02 std.err 5 -156.20202 -65.002930 -186.07852 -1.770056e+02 std.err 6 18.74808 -4.426999 10.32540 -3.626622e+01 Beta1 104.87136 165.145933 -509.81632 4.740792e+03 Beta2 -577.38969 174.074581 -836.92874 6.476366e+03 Beta3 2234.66560 167.387506 758.94102 2.619362e+02 Beta4 -13305.25242 4345.051843 -21124.78891 1.496245e+05

Does this mean that the initial model needs more optimisation?

On the other note, I have run a few lcmm models and noticed that at higher number of classes, some classes have 0% membership. From reading the FAQ, it seems like this is the local maximum and the model converged towards the lower class model. However, sometimes these models with 0% membership have a lower AIC/BIC value than the previous classes. May I know, in this situation, should we ignore the AIC/BIC value for such models?

Thank you so much for your advice.

On Thu, Oct 21, 2021 at 10:26 PM VivianePhilipps @.***> wrote:

Hi,

it looks like that you have negative variances in the initial model. Did you check the convergence of m1all.beta? If it converged (ie m1all.beta$conv = 1) could you please look at VarCov(m1all.beta) if there are any negative or infinite values.

Viviane

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CecileProust-Lima/lcmm/issues/93#issuecomment-948722582, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVO5V3JFGJZYIZUFL32QDV3UIAWKHANCNFSM5DLNZK7A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

VivianePhilipps commented 3 years ago

Yes, you should run the one class model again with more iterations (increase maxiter the the lcmm call). The gridsearch function needs indeed that the initial model converges, I will add this check in the code.

For your second point, say you wanted to estimate a 4 classes model and you have an empty class, so you reached a 3 classes model (a local maximum). But the AIC or BIC criteria are computed as if it were a 4 classes models, so they are not relevant in this particular situation.

Viviane