Closed steven-moore closed 6 years ago
This has been implemented and pushed. Needs to be tested against Stata or SAS results though.
Here are my results:
modeldata<-getModelData(exmetabdata,colvars="age",adjvars="smk_grp",modelspec="Interactive",rowvars="_1_2_propanediol") calcCorr(modeldata,exmetabdata, "DPP") [1] "running adjusted" [1] "Detected categorical adjustments, creating dummy variables" smk_grp1 smk_grp2 smk_grp3 _1_2_propanediol age [1,] 1 0 0 -1.47841 66 [2,] 0 0 0 -0.63942 61 [3,] 0 0 0 2.74333 71 [4,] 1 0 0 -1.36492 65 [5,] 0 0 0 -0.03118 56 [6,] 0 0 0 -0.89698 59 cohort spec model outcomespec exposurespec corr n pvalue 1 DPP Interactive 1 age -0.01736411 1000 0.5839456 adjvars outcome_uid outcome exposure_uid exposure 1 smk_grp1 smk_grp2 smk_grp3 1 1 age Age at Entry
modeldata2=modeldata modeldata2$gdta$smk_grp=as.numeric( modeldata2$gdta$smk_grp) calcCorr(modeldata2,exmetabdata, "DPP") [1] "running adjusted" smk_grp _1_2_propanediol age [1,] 2 -1.47841 66 [2,] 1 -0.63942 61 [3,] 1 2.74333 71 [4,] 2 -1.36492 65 [5,] 1 -0.03118 56 [6,] 1 -0.89698 59 cohort spec model outcomespec exposurespec corr n pvalue 1 DPP Interactive 1 age -0.01798162 1000 0.5702531 adjvars outcome_uid outcome exposure_uid exposure 1 smk_grp 1 1 age Age at Entry
My results line up with model 1 (see below). I assume model2 was just for testing purposes?
The CORR Procedure
3 Partial Variables: smk_grp1 smk_grp2 smk_grp3
1 With Variables: _1_2_PROPANEDIOL
1 Variables: age
Simple Statistics
Variable N Mean Std Dev Median Minimum Maximum Label
smk_grp1 1000 0.34000 0.47395 0 0 1.00000
smk_grp2 1000 0.05900 0.23574 0 0 1.00000
smk_grp3 1000 0.02500 0.15620 0 0 1.00000
_1_2_PROPANEDIOL 1000 0.25360 1.14499 -0.02409 -1.99878 5.00809
age 1000 63.20800 5.37807 63.00000 55.00000 74.00000 age
Spearman Partial Correlation Coefficients, N = 1000 Prob > |r| under H0: Partial Rho=0
age
_1_2_PROPANEDIOL -0.01736 0.5839
Have you tested yet when running all metabolites? Or when performing multiple models? Just curious about how the model is performing at scale.
Results attached for model above run on all metabolites (exposure is age, adjustment is smk_grp) and for Model "2 Multivariable adjusted".
Model_2_MultivariableAdjusted.csv2018-01-24.xlsx AllMets_Adjsmk_Age.csv2018-01-24.xlsx
I will spot check
My values are still a hair different, so something is up.
Ewy, are you able to work with SAS files? If so, would you be willing to check using the same data inputs that I am using? (see attachments) corr_check.zip
FYI, the files above are adjusted just for the three different smoking groups. I figured it's better to start simple.
Sorry Steve but I can't work with SAS files. How different are the values?
Hi Ewy,
We're getting errors in modeling of categorical variables. From my latest round of batch mode files (see age 2.1 and BMI 1.1 at https://cbiit-tools-results-dev.s3.amazonaws.com/comets/results/1519405529.zip?Signature=b0gFmnfVCA10hiEH38u0u19h1Qg%3D&Expires=1520011193&AWSAccessKeyId=AKIAIIFG2CUDIUGRJ54Q as examples), it appears that some of the categorical variables are being dropped inappropriately. And others, like fasting, are not being dropped when they should.
Testing in interactive mode confirms the same issues. I've included a screenshot in interactive mode below that shows the mismatch between what is going in the model, and what is being adjusted for in the final results. "Alc grp" for example should be included in the adjustment, but fasting (which has the same value for all participants in this study) should not.
To me, this seems like a counting error when marching through an array. In other words, you have an algorithm that counts the number of variables to determine what to drop, but that algorithm is not accounting for the fact that we've added dummy variables. So the wrong variables end up being being dropped.
This issue did not register before because I was using the wrong column name ("Coding" vs "VARTYPE") in my varmap tab.
I will post this message in Github for our records. The input file is also attached below
The issue above supersedes the need to compare SAS vs. R. In fact, using interactive mode, it will be relatively easy for me to make comparisons between SAS and R, so I can do this without swapping files with you.
I fixed this and pushed onto master. Could you please check?
Please wait. Found another bug upon further testing. Will prioritize in the morning.
Got it. Keep me posted😊
S
From: Ewy Mathe [mailto:notifications@github.com] Sent: Sunday, February 25, 2018 11:16 PM To: CBIIT/R-cometsAnalytics R-cometsAnalytics@noreply.github.com Cc: Moore, Steve (NIH/NCI) [E] steve.moore@nih.gov; Author author@noreply.github.com Subject: Re: [CBIIT/R-cometsAnalytics] COMETS 1.3 Permit adjustment for categorical variables (#19)
Please wait. Found another bug upon further testing. Will prioritize in the morning.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/CBIIT/R-cometsAnalytics/issues/19#issuecomment-368385682, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AhG9TaSdLaPl0FATaXgmxTR-alWXdhF7ks5tYjAGgaJpZM4RIwPe.
OK, just pushed and looks good. Steve, please see here for the output of BMI 1.1 and let me know if things look correct.
The results are more aligned but still not perfect. Let's discuss tomorrow after the demo. Could be a SAS vs. R issue, and is probably worth testing basic models in a systematic fashion on both platforms (without involving COMETS-analytics).
I don't want to have to go back to all of the cohorts after the fact because of a minor error in adjustments.
Sounds reasonable!
2018-02-26 14:52 GMT-05:00 Steven Moore notifications@github.com:
The results are more aligned but still not perfect. Let's discuss tomorrow after the demo. Could be a SAS vs. R issue, and is probably worth testing basic models in a systematic fashion on both platforms (without involving COMETS-analytics).
I don't want to have to go back to all of the cohorts after the fact because of a minor error in adjustments.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/CBIIT/R-cometsAnalytics/issues/19#issuecomment-368627861, or mute the thread https://github.com/notifications/unsubscribe-auth/AHowQrieiwjIek9Lck8kdAYTl-dezzqyks5tYwtfgaJpZM4RIwPe .
For Ella to verify. I have run three simple models in COMETS-analytics using different adjustments in (see link below), and need someone to verify that dummy coding is being handled properly. I have checked models in SAS using my own hand-coding of dummy variables. The resulting correlations and p-values are very close, but not perfectly identical. Unfortunately, this could just be a SAS vs. R issue.
Ella, would you be able to hand code the dummy variables called for in the adjustments and run the models in the spreadsheet? Kindly document the values in the attached spreadsheet, reupload, and if values match, close the issue.
just confirming, you are using the corr_test.zip
I am using the comets-analytics-test.org website directly.
S
From: ellatemprosa [mailto:notifications@github.com] Sent: Tuesday, February 27, 2018 3:54 PM To: CBIIT/R-cometsAnalytics R-cometsAnalytics@noreply.github.com Cc: Moore, Steve (NIH/NCI) [E] steve.moore@nih.gov; Author author@noreply.github.com Subject: Re: [CBIIT/R-cometsAnalytics] COMETS 1.3 Permit adjustment for categorical variables (#19)
just confirming, you are using the corr_test.zip
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/CBIIT/R-cometsAnalytics/issues/19#issuecomment-369022337, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AhG9TQl7uQyd25RvcRN6c0DwuMERfHotks5tZGtygaJpZM4RIwPe.
sorry wanted to confirm which input file, the zip or the cometsInput_March_2018.xlsx
cometsInput_March_2018.xlsx
From: ellatemprosa [mailto:notifications@github.com] Sent: Tuesday, February 27, 2018 3:59 PM To: CBIIT/R-cometsAnalytics R-cometsAnalytics@noreply.github.com Cc: Moore, Steve (NIH/NCI) [E] steve.moore@nih.gov; Author author@noreply.github.com Subject: Re: [CBIIT/R-cometsAnalytics] COMETS 1.3 Permit adjustment for categorical variables (#19)
sorry wanted to confirm which input file, the zip or the cometsInput_March_2018.xlsx
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/CBIIT/R-cometsAnalytics/issues/19#issuecomment-369023912, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AhG9TabDVyWgsWnaQ1jmjWHVlZHdwMheks5tZGyvgaJpZM4RIwPe.
also related to issue 32 https://github.com/CBIIT/R-cometsAnalytics/issues/32
To my understanding, the numbers are now lining up. Is this correct Ella? If yes, could you close this issue? I don't want to close without knowing for sure.
if you look at the example we are all testing focusing on bmi_grp=2 strata with outcome _1_2_dipalmitoylglycerol, exposure age and adjustment for race_grp
here's the ppcor result which is the same in sas
running all strata works with correlation exact but p-value still a tiny little off
but if you run restricted to strata = 2, we get missing results
but for bmi_grp = 3 it is good
for expanded p-value
OK, I'm using the psych::corr.p function because it's part of the psych package from which we're calculating the partial correlation matrix... will change to pcorr
Fingers crossed
Yes, use that package, see my markdown file, code is there
I've replaced with pcor. Please note though that now, the code is going to run very slow bc this function is just as expensive as cor.test (which is why we ended up implementing it "by hand"). I still think it's worth going back to the previous code using the psych package, and figuring out how to calculate the p-value appropriately.
pcor does solve all our issues though (including the NA issue when only looking at bmi-grp strata 2).
Results are pushed to GitHub.
One more thing: pcor fails if two adjustment variables have a correlation of 1 (which makes sense). I'd like to include a check for adjusted variables and remove variables that have a correlation 1 with another variable. It would throw a warning. I just want to check and make sure that is statistically sound? This example can happen when you stratify...Specifically, in the example data of the package, it happens for model 2.2 BMI stratified, stratification group 2. Let me know what you think.
We run the metabolites against each other, so not sure how this would work if remove variables.
When you say fail, do you mean crash?
S
From: Ewy Mathe [mailto:notifications@github.com] Sent: Wednesday, March 21, 2018 10:00 PM To: CBIIT/R-cometsAnalytics R-cometsAnalytics@noreply.github.com Cc: Moore, Steve (NIH/NCI) [E] steve.moore@nih.gov; State change state_change@noreply.github.com Subject: Re: [CBIIT/R-cometsAnalytics] COMETS 1.3 Permit adjustment for categorical variables (#19)
One more thing: pcor fails if two adjustment variables have a correlation of 1 (which makes sense). I'd like to include a check for adjusted variables and remove variables that have a correlation 1 with another variable. It would throw a warning. I just want to check and make sure that is statistically sound? This example can happen when you stratify...Specifically, in the example data of the package, it happens for model 2.2 BMI stratified, stratification group 2. Let me know what you think.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://github.com/CBIIT/R-cometsAnalytics/issues/19#issuecomment-375153615, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AhG9TWRrdGPXIefXUGbSFPNNGGOf4FXhks5tgwWNgaJpZM4RIwPe.
In batch mode, some of the models are not working. You can verify this by running Age 2.2, for example, which will return the message of
system is computationally singular: reciprocal condition number = 2.27816e-36
Also, see below:
Alternately, in Super Batch mode, you get the messages below. The last two messages I assume are related to singularity. Also, the BMI-strata warning is wrong here--there is more than one level of BMI.
Age.2.2 BMI stratified - Complete Warnings:
We are close. All of the models run except for models stratified by horm_curr. This particular bug occurs, I think, because some of the race_grp values only have one horm_curr value, causing the horm_curr dummy variables to be dropped. Here is a simple version of the model:
EXPOSURE: Age OUTCOME: All metabolites ADJUSTED COVARIATES: race_grp STRATA: horm_curr
This returns the following error message: supply both 'x' and 'y' or a matrix-like 'x'
Thinking this through has made me realize that a race_grp variable may have only one horm_curr value, but each horm_curr dummy variable could still have multiple race_grp values. We've been assuming an equivalence in the variables--that we can just pick one at random to drop--but this may not actually be the case.
Also, should the variable that is being stratified upon be part of the list of dummy variables being evaluated for correlation?
If we can answer these two questions and implement correctly, this issue will finally be solved.
Got the same error with this model EXPOSURE: Age OUTCOME: All metabolites ADJUSTED COVARIATES: bmi
@steven-moore i tested via command line your complicated example run for EXPOSURE: Age OUTCOME: All metabolites ADJUSTED COVARIATES: race_grp STRATA: horm_curr
and the algorithm works!!!
glad the solution using caret is doing well. this should be well documented on the paper Caret has some nice functions for handling some of the issues we are dealing with https://topepo.github.io/caret/pre-processing.html#dummy
from cox.ph
singular.ok | logical value indicating how to handle collinearity in the model matrix. If TRUE, the program will automatically skip over columns of the X matrix that are linear combinations of earlier columns. In this case the coefficients for such columns will be NA, and the variance matrix will contain zeros. For ancillary calculations, such as the linear predictor, the missing coefficients are treated as zeros.
i will now test whether we are getting the right results using raw code
Thank you Ella for the updates. We have one key issue remaining but I am optimistic that the fix will be easy. When any covariate in the model has only one value (in the test dataset, this includes "female" and "fasting"), and is combined with any other covariate, the model does not run, and instead returns the error below:
I believe this occurs because the CARET package is being called before covariates with only one value are excluded. So, for example, when the CARET package tries to evaluate singularity, etc. for "female" relative to other covariates, it fails because "female" has only one value.
The solution should be simple--exclude these covariates before calling the CARET package.
the fix i made seem to make this go away it just comes out with warning Warning: one of your models specifies fasted as an adjustment value but that variable only has one possible value. Model will run without fasted as an adjustment
The <15 was called twice so i took that out
i tested this scenario with updated code and it seems ok now in this commit https://github.com/CBIIT/R-cometsAnalytics/commit/c35674a744ef6025c5f6d6f9c01a749bbc367b37
I have completed testing and everything is working great. The app is correctly applying dynamic changes to the dummy variables used for adjustment. This includes stratified analyses.
So for example, in analyses of age-metabolite associations among those with heart-disease, the model is excluding educ_grp4, alc_grp3, multivitamin2, and horm_curr1 as dummy variables. In this small subset of participants (N=133), each of these variables is a linear combination of other covariates and is highly (or perfectly) collinear, and therefore should be excluded from adjustments.
To my understanding, the threshold for correlation used to make this determination is 0.95--which seems like a reasonable choice. Ella will provide further documentation of this.
I have also verified all of these calculations against SAS. Using the same sets of dummy variables (which are now shown in the output spreadsheet, in the adjspec column), I found 100% correspondence in r values and p-values between SAS and COMETS-analytics.
Ewy and Ella: amazing work!
So glad that all is working! And yes, confirming that the threshold for correlation is set to 0.95. It is hard-coded for now but we could make this a parameter that users can input if we want. Ewy
2018-04-23 15:06 GMT-04:00 Steven Moore notifications@github.com:
I have completed testing and everything is working great. The app is correctly applying dynamic changes to the dummy variables used for adjustment. This includes stratified analyses.
So for example, in analyses of age-metabolite associations among those with heart-disease, the model is excluding educ_grp4, alc_grp3, multivitamin2, and horm_curr1 as dummy variables. In this small subset of participants (N=133), each of these variables is a linear combination of other covariates and is highly (or perfectly) collinear, and therefore should be excluded from adjustments.
To my understanding, the threshold for correlation used to make this determination is 0.95--which seems like a reasonable choice. Ella will provide further documentation of this.
I have also verified all of these calculations against SAS. Using the same sets of dummy variables (which are now shown in the output spreadsheet, in the adjspec column), I found 100% correspondence in r values and p-values between SAS and COMETS-analytics.
Ewy and Ella: amazing work!
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/CBIIT/R-cometsAnalytics/issues/19#issuecomment-383687729, or mute the thread https://github.com/notifications/unsubscribe-auth/AHowQkrYs77avBQILrmvXjJCVcTuhytKks5triY3gaJpZM4RIwPe .
Currently, categorical variables are not properly adjusted for--they are entered into the model as continuous variables. Models should distinguish between categorical and continuous models using a new column that will be added to the Varmap tab.
This change will also require a change to the Sample file (to be logged separately) and to the "Create Input" utility (it needs to add this column to the input file that it creates).