CGA_XR dbSNP ID - Large Logistic Regression Coefficient

erscott commented 10 years ago

Here are the coeff for a Logistic Regression Model trained with NA12877 data [('CGA_SDO', -0.17694532513145272), ('GQ', 0.002224374491740649), ('DP', -0.13170585952474742), ('CGA_RDP', 0.054161178747585317), ('FT_PASS', 0.22249756454496386), ('FT_VQLOW', -0.42651704549676267), ('vartype1_del', 0.23825101496518389), ('vartype1_ins', 0.10170873833289407), ('vartype1_mnp', -0.13412761694268152), ('vartype1_snp', -0.40985161730723996), ('vartype2_del', -0.15075038149975903), ('vartype2_ins', -0.032873575059131931), ('vartype2_mnp', -0.029752329053545842), ('vartype2snp', 0.0093568046606915412), ('phase/', 0.5697832406315666), ('phase_|', -0.77861093954537608), ('zygosity_het-alt', -0.19980565605268524), ('zygosity_het-ref', -0.064063491513729476), ('zygosity_hom-alt', 0.05984966661455153), ('AA_GL', -0.0053635138209772162), ('AB_GL', 0.0040508143273663318), ('BB_GL', 0.0010023907008430484), ('AC_GL', -0.0028085290821979201), ('BC_GL', -0.038495879466188647), ('CC_GL', -0.0028085290821979201), ('AA_CGA_CEGL', -0.040157501206573902), ('AB_CGA_CEGL', -0.0072489272945790171), ('BB_CGA_CEGL', -0.068477505439082417), ('AC_CGA_CEGL', 0.042682227254171659), ('BC_CGA_CEGL', -0.038495879466188647), ('CC_CGA_CEGL', 0.042682227254171659), ('HQ_1', -0.0041590245092664943), ('HQ_2', -0.011446250751156526), ('EHQ_1', 0.0011518542020954358), ('EHQ_2', 0.0088462734372300317), ('CGA_CEHQ_1', 0.016010253748059811), ('CGA_CEHQ_2', 0.0023009523920917451), ('AD_1', 0.054391496800532214), ('AD_2', 0.066038449916474395), ('CGA_XR', 1.4966237727139036), ('CGA_RPT', -0.50355826238590506), ('multiallele', -0.20405801536568471)]

As you can see the CGA_XR has the largest coeff by far. This worries me a bit because I think it will severely penalize novel variants, which are often the most interesting. Also, this training/testing data set has been extensively profiled in the past yielding many dbSNP ids. Therefore, I think it is not surprising to see the large coeff for the CGA_XR feature with this training data and an improved model fit, though I think this is likely over-fitting. I think we should exclude CGA_XR as a feature, because most users of this method will want to maximize the yield from the rarer variants, and I think we can probably salvage the known variants through imputation.

I have trained two Random Forest models, +/- CGA_XR, and there are subtle differences. Here are the false positive (fp, 9985 total variants in fp test set) and true positive (tp, 99945 total variants in tp test set) numbers using different probability cutoffs:

Without CGA_XR: 0.28 prob cut-off fp: 1026 tp: 91178

0.30 prob cut-off fp: 1092 tp: 91860

With CGA_XR: 0.28 prob cut-off fp: 1011 tp: 91612

0.30 prob cut-off fp: 1072 tp: 92314

It looks like we gain about 400-500 true positives per 100k true positive variants when using the CGA_XR feature. ~3.6million tp in a genome, therefore we will lose about ~16000 tp variants for each genome. Also ~500 additional false positive variants will be added per genome if we don't use the CGA_XR feature.

I think we should compare the allele frequency distribution of the true positive variants that are disjoint between the results from the two different models (with/without CGA_XR).

kunalbhutani commented 10 years ago

we can approach the same idea of tranch levels and threshold on what % of dbsnps we are able to call reliably as well. so don't use it as a feature in training, but do use it as a sort of latent variable to find cut-offs instead of just looking at the roc curve

erickramer commented 10 years ago

I tossed this when I was training models for this reason. Almost all of the TPs have dbSNP ids

On Thu, Jul 24, 2014 at 5:25 PM, kunalbhutani notifications@github.com wrote:

we can approach the same idea of tranch levels and threshold on what % of dbsnps we are able to call reliably as well. so don't use it as a feature in training, but do use it as a sort of latent variable to find cut-offs instead of just looking at the roc curve

Reply to this email directly or view it on GitHub https://github.com/Schork-Lab/cg-classifier/issues/4#issuecomment-50095298 .

E. Ransom Kramer Torkamani Lab The Scripps Research Institute

Schork-Lab / cg-classifier

CGA_XR dbSNP ID - Large Logistic Regression Coefficient #4