CBIIT / R-cometsAnalytics

R package development for COMETS Analytics
12 stars 10 forks source link

COMETS 1.3. Handle metabolites where variance=0 #24

Closed steven-moore closed 6 years ago

steven-moore commented 6 years ago

It will sometimes happen that a metabolite has no variance, i.e. has the same value for every single participant. When this occurs, there should be no analysis/results for this metabolite, but analysis/results for other metabolites should carry forward as normal. Currently, however, the analysis crashes when it runs into any metabolite with variance=0.

We need a better method for handling metabolites where variance=0.

steven-moore commented 6 years ago

Prior e-mail:

Hi Ewy and Ella,

cometsInput_adjustment_test.xlsx

I’ve got this pinned down now. Use the file attached for a quick walkthrough. This file is identical to the sample download file in every way except that for the first metabolite--_1_2_3_benzenetriol_sulfate_2—I have overwritten every single value with “1”. This is, unfortunately, a fairly frequent scenario with Metabolon data. Basically, a metabolite can appear in the 10% QC, but not have valid values in the participants of interest. This leads to every single observation having the same value in the dataset being analyzed.

Steps to reproduce the bug and to demonstrate that the bug is specific to metabolites with no variance.:

  1. Go to comets-analytics.org and input the dataset.
  2. Select age as exposure and all metabolites as outcome. Click “run model”. Note that results appear and values are correct, except that for _1_2_3_benzenetriol_sulfate_2 there is now a NA for its correlation value. So far, so good.
  3. In the “pvalue” column, enter “.00001” in the max box.
  4. “Check all” and click the tag button. Enter “age-related” as the tag name.
  5. Select bmi_grp as exposure and all metabolites as outcome. Click “run model”. Note that this model also runs fine.
  6. Run age as exposure, all metabolites as outcome, and bmi_grp as an “Adjusted covariate”. Click “run model”. Note that this returns NA for every single metabolite. This is obviously an error.
  7. Run age as exposure, age-related as outcome, and bmi_grp as “Adjusted covariate”. Click “run model”. Note that this runs fine.

To me, it seems like there is flag that is being triggered when it runs into a metabolite with no variance, but then that flag is not being turned off or is being misapplied to all metabolites instead of specific ones.

This has been an issue for two of six cohorts so far, and will likely repeat itself in many others. So, a highest priority fix.

S

ellatemprosa commented 6 years ago

good research! we can screen zero variance in 2 stages, at integrity check and specific models

steven-moore commented 6 years ago

Yes, we'll need to check for each strata as well, unless there is a more generic alternative to changing how R is dealing with this.

ewymathe commented 6 years ago

This seems to work as expected in the R package. Using the example from Steve above and running model as outlined in "step 6", I get the following:

modeldata<-getModelData(exmetabdata,colvars="age",modelspec="Interactive",adjvars="bmi_grp") [1] "Analysis will run on 'All metabolites'" head(calcCorr(modeldata,exmetabdata)) [1] "running adjusted" cohort spec model outcomespec exposurespec 1 Interactive _1_2_3_benzenetriol_sulfate_2 age 2 Interactive _1_2_dipalmitoylglycerol age 3 Interactive _1_2_propanediol age 4 Interactive _1_3_7_trimethylurate age 5 Interactive _1_3_dimethylurate age 6 Interactive _1_3_dipalmitoylglycerol age corr n pvalue adjvars outcome_uid 1 NaN 1000 NaN bmi_grp _1_2_3_benzenetriol_sulfate_2 2 -0.0006160116 1000 0.9844854 bmi_grp _1_2_dipalmitoylglycerol 3 -0.0205791015 1000 0.5158877 bmi_grp _1_2_propanediol 4 -0.0042940233 1000 0.8921757 bmi_grp _1_3_7_trimethylurate 5 -0.0331511567 1000 0.2951996 bmi_grp _1_3_dimethylurate 6 -0.0343873521 1000 0.2775497 bmi_grp _1_3_dipalmitoylglycerol outcome exposure_uid exposure 1 _1_2_3_benzenetriol_sulfate_2 age Age at Entry 2 _1_2_dipalmitoylglycerol age Age at Entry 3 _1_2_propanediol age Age at Entry 4 _1_3_7_trimethylurate age Age at Entry 5 _1_3_dimethylurate age Age at Entry 6 _1_3_dipalmitoylglycerol age Age at Entry Warning message: In cov2cor(X.resid) : diag(.) had 0 or NA entries; non-finite result is doubtful

This may be something at the GUI level.

steven-moore commented 6 years ago

Based on the warning message, it seems like metabolites with 0 variance cause an automatic error in R, causing it to exit all further analyses. Am I correct Ewy?

If this is the case, then each analysis (especially stratified analyses) will need to check the list of metabolites for 0 variance issues (i.e. the fix cannot be done by changing model output).

ewymathe commented 6 years ago

No, the metabolites with 0 variance cause a warning, and keeps going. So that the output includes all the metabolites, and NA for the metabolites with 0 variance (sorry, hadn't pasted the full output earlier):

modeldata<-getModelData(exmetabdata,colvars="age",modelspec="Interactive",adjvars="bmi_grp") [1] "Analysis will run on 'All metabolites'" head(calcCorr(modeldata,exmetabdata, "DPP")) [1] "running adjusted" cohort spec model outcomespec exposurespec 1 DPP Interactive _1_2_3_benzenetriol_sulfate_2 age 2 DPP Interactive _1_2_dipalmitoylglycerol age 3 DPP Interactive _1_2_propanediol age 4 DPP Interactive _1_3_7_trimethylurate age 5 DPP Interactive _1_3_dimethylurate age 6 DPP Interactive _1_3_dipalmitoylglycerol age corr n pvalue adjvars outcome_uid 1 NaN 1000 NaN bmi_grp _1_2_3_benzenetriol_sulfate_2 2 -0.0006160116 1000 0.9844854 bmi_grp _1_2_dipalmitoylglycerol 3 -0.0205791015 1000 0.5158877 bmi_grp _1_2_propanediol 4 -0.0042940233 1000 0.8921757 bmi_grp _1_3_7_trimethylurate 5 -0.0331511567 1000 0.2951996 bmi_grp _1_3_dimethylurate 6 -0.0343873521 1000 0.2775497 bmi_grp _1_3_dipalmitoylglycerol outcome exposure_uid exposure 1 _1_2_3_benzenetriol_sulfate_2 age Age at Entry 2 _1_2_dipalmitoylglycerol age Age at Entry 3 _1_2_propanediol age Age at Entry 4 _1_3_7_trimethylurate age Age at Entry 5 _1_3_dimethylurate age Age at Entry 6 _1_3_dipalmitoylglycerol age Age at Entry

ewymathe commented 6 years ago

`> modeldata<-getModelData(exmetabdata,colvars="age",modelspec="Interactive",adjvars="bmi_grp") [1] "Analysis will run on 'All metabolites'"

head(calcCorr(modeldata,exmetabdata, "DPP")) [1] "running adjusted" cohort spec model outcomespec exposurespec 1 DPP Interactive _1_2_3_benzenetriol_sulfate_2 age 2 DPP Interactive _1_2_dipalmitoylglycerol age 3 DPP Interactive _1_2_propanediol age 4 DPP Interactive _1_3_7_trimethylurate age 5 DPP Interactive _1_3_dimethylurate age 6 DPP Interactive _1_3_dipalmitoylglycerol age corr n pvalue adjvars outcome_uid 1 NaN 1000 NaN bmi_grp _1_2_3_benzenetriol_sulfate_2 2 -0.0006160116 1000 0.9844854 bmi_grp _1_2_dipalmitoylglycerol 3 -0.0205791015 1000 0.5158877 bmi_grp _1_2_propanediol 4 -0.0042940233 1000 0.8921757 bmi_grp _1_3_7_trimethylurate 5 -0.0331511567 1000 0.2951996 bmi_grp _1_3_dimethylurate 6 -0.0343873521 1000 0.2775497 bmi_grp _1_3_dipalmitoylglycerol outcome exposure_uid exposure 1 _1_2_3_benzenetriol_sulfate_2 age Age at Entry 2 _1_2_dipalmitoylglycerol age Age at Entry 3 _1_2_propanediol age Age at Entry 4 _1_3_7_trimethylurate age Age at Entry 5 _1_3_dimethylurate age Age at Entry 6 _1_3_dipalmitoylglycerol age Age at Entry`

steven-moore commented 6 years ago

Try this in combination with statistical adjustment. I think that's where the real problems come in.

ewymathe commented 6 years ago

The model is adjusted for "bmi-grp", and running using All metabolites vs age as outcome.

steven-moore commented 6 years ago

OK, so we could handle on either the input or output end. Let's discuss on Friday.

steven-moore commented 6 years ago

For that discussion on Friday:

On the input end, it may be logical for us to check--in each subgroup examined--whether there is sufficient N above "limit of detection" (calculated the same way we did in the "integrity check"). For example, at least 15 values above "limit of detection" for each metabolite in each analysis. This would entail creating a backend table that tracks these values, then drops metabolites from our array if it doesn't meet our criteria.

steven-moore commented 6 years ago

Last we left off, Wesley was planning on checking whether a different R environment would resolve this issue. Wesley, has this testing been done?

ewymathe commented 6 years ago

Good Steve point. We could certainly implement a filterMetabolite function that would remove metabolites that have low variance or many missing values. I do this routinely for all my analysis. However, not sure we should do this on a model-model basis since filtering should be independent of outcome. Ewy

2018-01-24 13:38 GMT-05:00 Steven Moore notifications@github.com:

For that discussion on Friday:

On the input end, it may be logical for us to check--in each subgroup examined--whether there is sufficient N above "limit of detection" (calculated the same way we did in the "integrity check"). For example, at least 15 values above "limit of detection" for each metabolite in each analysis. This would entail creating a backend table that tracks these values, then drops metabolites from our array if it doesn't meet our criteria.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/CBIIT/R-cometsAnalytics/issues/24#issuecomment-360231545, or mute the thread https://github.com/notifications/unsubscribe-auth/AHowQvSCTUiLxNZWixhl4HAV-Rb5GSrNks5tN3iagaJpZM4RIz0O .

steven-moore commented 6 years ago

In the new environment, the NA values are handled appropriately, and they do not crash in models with additional adjustments, etc. This was handled well across all models in the zip file. This issue is now closed.