COMETS 1.3 Permit adjustment for categorical variables

steven-moore commented 6 years ago

Currently, categorical variables are not properly adjusted for--they are entered into the model as continuous variables. Models should distinguish between categorical and continuous models using a new column that will be added to the Varmap tab.

This change will also require a change to the Sample file (to be logged separately) and to the "Create Input" utility (it needs to add this column to the input file that it creates).

ewymathe commented 6 years ago

This has been implemented and pushed. Needs to be tested against Stata or SAS results though.

Here are my results:

modeldata<-getModelData(exmetabdata,colvars="age",adjvars="smk_grp",modelspec="Interactive",rowvars="_1_2_propanediol") calcCorr(modeldata,exmetabdata, "DPP") [1] "running adjusted" [1] "Detected categorical adjustments, creating dummy variables" smk_grp1 smk_grp2 smk_grp3 _1_2_propanediol age [1,] 1 0 0 -1.47841 66 [2,] 0 0 0 -0.63942 61 [3,] 0 0 0 2.74333 71 [4,] 1 0 0 -1.36492 65 [5,] 0 0 0 -0.03118 56 [6,] 0 0 0 -0.89698 59 cohort spec model outcomespec exposurespec corr n pvalue 1 DPP Interactive 1 age -0.01736411 1000 0.5839456 adjvars outcome_uid outcome exposure_uid exposure 1 smk_grp1 smk_grp2 smk_grp3 1 1 age Age at Entry

modeldata2=modeldata modeldata2$gdta$smk_grp=as.numeric( modeldata2$gdta$smk_grp) calcCorr(modeldata2,exmetabdata, "DPP") [1] "running adjusted" smk_grp _1_2_propanediol age [1,] 2 -1.47841 66 [2,] 1 -0.63942 61 [3,] 1 2.74333 71 [4,] 2 -1.36492 65 [5,] 1 -0.03118 56 [6,] 1 -0.89698 59 cohort spec model outcomespec exposurespec corr n pvalue 1 DPP Interactive 1 age -0.01798162 1000 0.5702531 adjvars outcome_uid outcome exposure_uid exposure 1 smk_grp 1 1 age Age at Entry

steven-moore commented 6 years ago

My results line up with model 1 (see below). I assume model2 was just for testing purposes?

The CORR Procedure

3 Partial Variables: smk_grp1 smk_grp2 smk_grp3
1 With Variables: _1_2_PROPANEDIOL
1 Variables: age

                                         Simple Statistics

Variable N Mean Std Dev Median Minimum Maximum Label

smk_grp1 1000 0.34000 0.47395 0 0 1.00000
smk_grp2 1000 0.05900 0.23574 0 0 1.00000
smk_grp3 1000 0.02500 0.15620 0 0 1.00000
_1_2_PROPANEDIOL 1000 0.25360 1.14499 -0.02409 -1.99878 5.00809
age 1000 63.20800 5.37807 63.00000 55.00000 74.00000 age

Spearman Partial Correlation Coefficients, N = 1000 Prob > |r| under H0: Partial Rho=0

age

_1_2_PROPANEDIOL -0.01736 0.5839

steven-moore commented 6 years ago

Have you tested yet when running all metabolites? Or when performing multiple models? Just curious about how the model is performing at scale.

ewymathe commented 6 years ago

Results attached for model above run on all metabolites (exposure is age, adjustment is smk_grp) and for Model "2 Multivariable adjusted".

Model_2_MultivariableAdjusted.csv2018-01-24.xlsx AllMets_Adjsmk_Age.csv2018-01-24.xlsx

steven-moore commented 6 years ago

I will spot check

steven-moore commented 6 years ago

My values are still a hair different, so something is up.

Ewy, are you able to work with SAS files? If so, would you be willing to check using the same data inputs that I am using? (see attachments) corr_check.zip

steven-moore commented 6 years ago

FYI, the files above are adjusted just for the three different smoking groups. I figured it's better to start simple.

ewymathe commented 6 years ago

Sorry Steve but I can't work with SAS files. How different are the values?

steven-moore commented 6 years ago

Hi Ewy,

We're getting errors in modeling of categorical variables. From my latest round of batch mode files (see age 2.1 and BMI 1.1 at https://cbiit-tools-results-dev.s3.amazonaws.com/comets/results/1519405529.zip?Signature=b0gFmnfVCA10hiEH38u0u19h1Qg%3D&Expires=1520011193&AWSAccessKeyId=AKIAIIFG2CUDIUGRJ54Q as examples), it appears that some of the categorical variables are being dropped inappropriately. And others, like fasting, are not being dropped when they should.

Testing in interactive mode confirms the same issues. I've included a screenshot in interactive mode below that shows the mismatch between what is going in the model, and what is being adjusted for in the final results. "Alc grp" for example should be included in the adjustment, but fasting (which has the same value for all participants in this study) should not.

To me, this seems like a counting error when marching through an array. In other words, you have an algorithm that counts the number of variables to determine what to drop, but that algorithm is not accounting for the fact that we've added dummy variables. So the wrong variables end up being being dropped.

This issue did not register before because I was using the wrong column name ("Coding" vs "VARTYPE") in my varmap tab.

I will post this message in Github for our records. The input file is also attached below

cometsInput_March_2018.xlsx

steven-moore commented 6 years ago

The issue above supersedes the need to compare SAS vs. R. In fact, using interactive mode, it will be relatively easy for me to make comparisons between SAS and R, so I can do this without swapping files with you.

ewymathe commented 6 years ago

I fixed this and pushed onto master. Could you please check?

ewymathe commented 6 years ago

Please wait. Found another bug upon further testing. Will prioritize in the morning.

steven-moore commented 6 years ago

Got it. Keep me posted😊

S

From: Ewy Mathe [mailto:notifications@github.com] Sent: Sunday, February 25, 2018 11:16 PM To: CBIIT/R-cometsAnalytics R-cometsAnalytics@noreply.github.com Cc: Moore, Steve (NIH/NCI) [E] steve.moore@nih.gov; Author author@noreply.github.com Subject: Re: [CBIIT/R-cometsAnalytics] COMETS 1.3 Permit adjustment for categorical variables (#19)

Please wait. Found another bug upon further testing. Will prioritize in the morning.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/CBIIT/R-cometsAnalytics/issues/19#issuecomment-368385682, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AhG9TaSdLaPl0FATaXgmxTR-alWXdhF7ks5tYjAGgaJpZM4RIwPe.

ewymathe commented 6 years ago

OK, just pushed and looks good. Steve, please see here for the output of BMI 1.1 and let me know if things look correct.

BMI.1.1_results.xlsx

steven-moore commented 6 years ago

The results are more aligned but still not perfect. Let's discuss tomorrow after the demo. Could be a SAS vs. R issue, and is probably worth testing basic models in a systematic fashion on both platforms (without involving COMETS-analytics).

I don't want to have to go back to all of the cohorts after the fact because of a minor error in adjustments.

ewymathe commented 6 years ago

Sounds reasonable!

2018-02-26 14:52 GMT-05:00 Steven Moore notifications@github.com:

The results are more aligned but still not perfect. Let's discuss tomorrow after the demo. Could be a SAS vs. R issue, and is probably worth testing basic models in a systematic fashion on both platforms (without involving COMETS-analytics).

I don't want to have to go back to all of the cohorts after the fact because of a minor error in adjustments.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/CBIIT/R-cometsAnalytics/issues/19#issuecomment-368627861, or mute the thread https://github.com/notifications/unsubscribe-auth/AHowQrieiwjIek9Lck8kdAYTl-dezzqyks5tYwtfgaJpZM4RIwPe .

steven-moore commented 6 years ago

For Ella to verify. I have run three simple models in COMETS-analytics using different adjustments in (see link below), and need someone to verify that dummy coding is being handled properly. I have checked models in SAS using my own hand-coding of dummy variables. The resulting correlations and p-values are very close, but not perfectly identical. Unfortunately, this could just be a SAS vs. R issue.

Ella, would you be able to hand code the dummy variables called for in the adjustments and run the models in the spreadsheet? Kindly document the values in the attached spreadsheet, reupload, and if values match, close the issue.

COMETS-analytics_v_direct_r_code.xlsx

ellatemprosa commented 6 years ago

just confirming, you are using the corr_test.zip

steven-moore commented 6 years ago

I am using the comets-analytics-test.org website directly.

S

From: ellatemprosa [mailto:notifications@github.com] Sent: Tuesday, February 27, 2018 3:54 PM To: CBIIT/R-cometsAnalytics R-cometsAnalytics@noreply.github.com Cc: Moore, Steve (NIH/NCI) [E] steve.moore@nih.gov; Author author@noreply.github.com Subject: Re: [CBIIT/R-cometsAnalytics] COMETS 1.3 Permit adjustment for categorical variables (#19)

just confirming, you are using the corr_test.zip

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/CBIIT/R-cometsAnalytics/issues/19#issuecomment-369022337, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AhG9TQl7uQyd25RvcRN6c0DwuMERfHotks5tZGtygaJpZM4RIwPe.

ellatemprosa commented 6 years ago

sorry wanted to confirm which input file, the zip or the cometsInput_March_2018.xlsx

steven-moore commented 6 years ago

cometsInput_March_2018.xlsx

From: ellatemprosa [mailto:notifications@github.com] Sent: Tuesday, February 27, 2018 3:59 PM To: CBIIT/R-cometsAnalytics R-cometsAnalytics@noreply.github.com Cc: Moore, Steve (NIH/NCI) [E] steve.moore@nih.gov; Author author@noreply.github.com Subject: Re: [CBIIT/R-cometsAnalytics] COMETS 1.3 Permit adjustment for categorical variables (#19)

sorry wanted to confirm which input file, the zip or the cometsInput_March_2018.xlsx

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/CBIIT/R-cometsAnalytics/issues/19#issuecomment-369023912, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AhG9TabDVyWgsWnaQ1jmjWHVlZHdwMheks5tZGyvgaJpZM4RIwPe.

ellatemprosa commented 6 years ago

also related to issue 32 https://github.com/CBIIT/R-cometsAnalytics/issues/32

steven-moore commented 6 years ago

To my understanding, the numbers are now lining up. Is this correct Ella? If yes, could you close this issue? I don't want to close without knowing for sure.

ellatemprosa commented 6 years ago

if you look at the example we are all testing focusing on bmi_grp=2 strata with outcome _1_2_dipalmitoylglycerol, exposure age and adjustment for race_grp

here's the ppcor result which is the same in sas

running all strata works with correlation exact but p-value still a tiny little off

but if you run restricted to strata = 2, we get missing results

but for bmi_grp = 3 it is good

for expanded p-value

ewymathe commented 6 years ago

OK, I'm using the psych::corr.p function because it's part of the psych package from which we're calculating the partial correlation matrix... will change to pcorr

steven-moore commented 6 years ago

Fingers crossed

ellatemprosa commented 6 years ago

Yes, use that package, see my markdown file, code is there

ewymathe commented 6 years ago

I've replaced with pcor. Please note though that now, the code is going to run very slow bc this function is just as expensive as cor.test (which is why we ended up implementing it "by hand"). I still think it's worth going back to the previous code using the psych package, and figuring out how to calculate the p-value appropriately.
pcor does solve all our issues though (including the NA issue when only looking at bmi-grp strata 2). Results are pushed to GitHub.

ewymathe commented 6 years ago

One more thing: pcor fails if two adjustment variables have a correlation of 1 (which makes sense). I'd like to include a check for adjusted variables and remove variables that have a correlation 1 with another variable. It would throw a warning. I just want to check and make sure that is statistically sound? This example can happen when you stratify...Specifically, in the example data of the package, it happens for model 2.2 BMI stratified, stratification group 2. Let me know what you think.

steven-moore commented 6 years ago

We run the metabolites against each other, so not sure how this would work if remove variables.

When you say fail, do you mean crash?

S

From: Ewy Mathe [mailto:notifications@github.com] Sent: Wednesday, March 21, 2018 10:00 PM To: CBIIT/R-cometsAnalytics R-cometsAnalytics@noreply.github.com Cc: Moore, Steve (NIH/NCI) [E] steve.moore@nih.gov; State change state_change@noreply.github.com Subject: Re: [CBIIT/R-cometsAnalytics] COMETS 1.3 Permit adjustment for categorical variables (#19)

One more thing: pcor fails if two adjustment variables have a correlation of 1 (which makes sense). I'd like to include a check for adjusted variables and remove variables that have a correlation 1 with another variable. It would throw a warning. I just want to check and make sure that is statistically sound? This example can happen when you stratify...Specifically, in the example data of the package, it happens for model 2.2 BMI stratified, stratification group 2. Let me know what you think.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://github.com/CBIIT/R-cometsAnalytics/issues/19#issuecomment-375153615, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AhG9TWRrdGPXIefXUGbSFPNNGGOf4FXhks5tgwWNgaJpZM4RIwPe.

steven-moore commented 6 years ago

In batch mode, some of the models are not working. You can verify this by running Age 2.2, for example, which will return the message of

system is computationally singular: reciprocal condition number = 2.27816e-36

Also, see below:

Alternately, in Super Batch mode, you get the messages below. The last two messages I assume are related to singularity. Also, the BMI-strata warning is wrong here--there is more than one level of BMI.

Age.2.2 BMI stratified - Complete Warnings:

Warning: one of your models specifies female as an adjustment but that variable only has one possible value. Model will run without female adjusted.
Warning: one of your models specifies fasted as an adjustment but that variable only has one possible value. Model will run without fasted adjusted.
Warning: one of your models specifies study_site as an adjustment but that variable only has one possible value. Model will run without study_site adjusted.
Warning: one of your models specifies bmi_grp as a stratification but that variable only has one possible value. Model will run without bmi_grp stratified
The inverse of variance-covariance matrix is calculated using Moore-Penrose generalized matrix invers due to its determinant of zero.
NaNs produced

steven-moore commented 6 years ago

We are close. All of the models run except for models stratified by horm_curr. This particular bug occurs, I think, because some of the race_grp values only have one horm_curr value, causing the horm_curr dummy variables to be dropped. Here is a simple version of the model:

EXPOSURE: Age OUTCOME: All metabolites ADJUSTED COVARIATES: race_grp STRATA: horm_curr

This returns the following error message: supply both 'x' and 'y' or a matrix-like 'x'

Thinking this through has made me realize that a race_grp variable may have only one horm_curr value, but each horm_curr dummy variable could still have multiple race_grp values. We've been assuming an equivalence in the variables--that we can just pick one at random to drop--but this may not actually be the case.

Also, should the variable that is being stratified upon be part of the list of dummy variables being evaluated for correlation?

If we can answer these two questions and implement correctly, this issue will finally be solved.

kailingchen commented 6 years ago

Got the same error with this model EXPOSURE: Age OUTCOME: All metabolites ADJUSTED COVARIATES: bmi

ellatemprosa commented 6 years ago

@steven-moore i tested via command line your complicated example run for EXPOSURE: Age OUTCOME: All metabolites ADJUSTED COVARIATES: race_grp STRATA: horm_curr

and the algorithm works!!!

ellatemprosa commented 6 years ago

glad the solution using caret is doing well. this should be well documented on the paper Caret has some nice functions for handling some of the issues we are dealing with https://topepo.github.io/caret/pre-processing.html#dummy

from cox.ph

singular.ok | logical value indicating how to handle collinearity in the model matrix. If TRUE, the program will automatically skip over columns of the X matrix that are linear combinations of earlier columns. In this case the coefficients for such columns will be NA, and the variance matrix will contain zeros. For ancillary calculations, such as the linear predictor, the missing coefficients are treated as zeros.

ellatemprosa commented 6 years ago

i will now test whether we are getting the right results using raw code

steven-moore commented 6 years ago

Thank you Ella for the updates. We have one key issue remaining but I am optimistic that the fix will be easy. When any covariate in the model has only one value (in the test dataset, this includes "female" and "fasting"), and is combined with any other covariate, the model does not run, and instead returns the error below:

I believe this occurs because the CARET package is being called before covariates with only one value are excluded. So, for example, when the CARET package tries to evaluate singularity, etc. for "female" relative to other covariates, it fails because "female" has only one value.

The solution should be simple--exclude these covariates before calling the CARET package.

ellatemprosa commented 6 years ago

the fix i made seem to make this go away it just comes out with warning Warning: one of your models specifies fasted as an adjustment value but that variable only has one possible value. Model will run without fasted as an adjustment

ellatemprosa commented 6 years ago

The <15 was called twice so i took that out

ellatemprosa commented 6 years ago

i tested this scenario with updated code and it seems ok now in this commit https://github.com/CBIIT/R-cometsAnalytics/commit/c35674a744ef6025c5f6d6f9c01a749bbc367b37

steven-moore commented 6 years ago

I have completed testing and everything is working great. The app is correctly applying dynamic changes to the dummy variables used for adjustment. This includes stratified analyses.

So for example, in analyses of age-metabolite associations among those with heart-disease, the model is excluding educ_grp4, alc_grp3, multivitamin2, and horm_curr1 as dummy variables. In this small subset of participants (N=133), each of these variables is a linear combination of other covariates and is highly (or perfectly) collinear, and therefore should be excluded from adjustments.

To my understanding, the threshold for correlation used to make this determination is 0.95--which seems like a reasonable choice. Ella will provide further documentation of this.

I have also verified all of these calculations against SAS. Using the same sets of dummy variables (which are now shown in the output spreadsheet, in the adjspec column), I found 100% correspondence in r values and p-values between SAS and COMETS-analytics.

Ewy and Ella: amazing work!

ewymathe commented 6 years ago

So glad that all is working! And yes, confirming that the threshold for correlation is set to 0.95. It is hard-coded for now but we could make this a parameter that users can input if we want. Ewy

2018-04-23 15:06 GMT-04:00 Steven Moore notifications@github.com:

I have completed testing and everything is working great. The app is correctly applying dynamic changes to the dummy variables used for adjustment. This includes stratified analyses.

So for example, in analyses of age-metabolite associations among those with heart-disease, the model is excluding educ_grp4, alc_grp3, multivitamin2, and horm_curr1 as dummy variables. In this small subset of participants (N=133), each of these variables is a linear combination of other covariates and is highly (or perfectly) collinear, and therefore should be excluded from adjustments.

To my understanding, the threshold for correlation used to make this determination is 0.95--which seems like a reasonable choice. Ella will provide further documentation of this.

I have also verified all of these calculations against SAS. Using the same sets of dummy variables (which are now shown in the output spreadsheet, in the adjspec column), I found 100% correspondence in r values and p-values between SAS and COMETS-analytics.

Ewy and Ella: amazing work!

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/CBIIT/R-cometsAnalytics/issues/19#issuecomment-383687729, or mute the thread https://github.com/notifications/unsubscribe-auth/AHowQkrYs77avBQILrmvXjJCVcTuhytKks5triY3gaJpZM4RIwPe .

CBIIT / R-cometsAnalytics

COMETS 1.3 Permit adjustment for categorical variables #19