gaow / neuro-twas

Development code (private repo) for TWAS related multiomic gene-mapping in Alzheimer's disease data
6 stars 0 forks source link

Potential issues in mv-susie cross-validation #36

Closed hsun3163 closed 3 years ago

hsun3163 commented 3 years ago

The primary difference between cv in mv-susie and fusion is that, in fusion, a k fold cv was performed k times while in mv-susie, following the instruction here:https://p8105.com/cross_validation.html, 100 times of k fold cv were conducted.

In theory, with more iteration of cv, the letter shall produced a more accurate estimate of the true performance, and the cv result produced this way shall be more reliable.

However, in the cv of some genes, namely ENSG00000265148, a bimodal distribution of r2, is observed, which may be a indication of problems.

Screen Shot 2021-02-22 at 7 07 11 PM

it shall be noted that this gene is excluded from the fusion-twas by hsq check.

Next courses of actions would be

  1. to redo the cv for this gene and some other genes with the fusion way (Unlikely to show bimodal pattern with 5 iterations)
  2. redo cv of the fusion weights in some other genes our way and compare the result. (Expected bell shaped result revolving the reported r2)

    Another possibility is that the problem arose from the cleaning of maf = 0 snps in X. However, removing this cleaning steps will leads to errors.

gaow commented 3 years ago

@hsun3163 I wonder if it is possible, in terms of engineering, to include mvSuSiE into FUSION script so we can use the same CV procedure?

Once you have cleaned up your workflow you can pin me to review it, and I'll see if I can identify something problematic. In the mean time I'm wondering if it is worth posting something at cross validated to see if people have some general experience with this behavior (not sure if there is enough info for them to weigh in but might not hurt to try)

hsun3163 commented 3 years ago

@hsun3163 I wonder if it is possible, in terms of engineering, to include mvSuSiE into FUSION script so we can use the same CV procedure?

Once you have cleaned up your workflow you can pin me to review it, and I'll see if I can identify something problematic. In the mean time I'm wondering if it is worth posting something at cross validated to see if people have some general experience with this behavior (not sure if there is enough info for them to weigh in but might not hurt to try)

The FUSION script is dedicated to univariate analysis and I would rather not make the already bulky wrapper even bulkier by putting mv susie analysis into it.

I have rethink about the currently implemented cross-validation process and confirmed that the problem occurs at the post cv data cleaning steps, namely:

Another possibility is that the problem arose from the cleaning of maf = 0 snps in X after splitting. However, removing this cleaning steps will leads to errors`

It turns out that the doing so will result in different sets of SNPs included in different cv run, and invalidate the cv process. For example, there is only one snps are shared between the run that produce the high r2 and low r2 in the image above. The SNPs that could best predict the expression weight is excluded in the low r2 run.

Currently I removed the post-cv X cleaning steps, and the subsequent X scale step. The post cv scale(X), which leads to error, is not included in the Fusion script. I included it in hope of gaining more internal validity.

If that is a success, I actually prefer doing the cv our ways instead of the fusion way and am thinking about changing the cv part of their code into ours as only doing 5 iteration of cv(FUSIOn) is a little bit unreliable.

gaow commented 3 years ago

The FUSION script is dedicated to univariate analysis and I would rather not make the already bulky wrapper even bulkier by putting mv susie analysis into it.

I was actually suggesting the other way around -- if you could modularize that function in FUSION so it can be shared with mvSuSiE outside FUSION code that'd be ideal

It turns out that the doing so will result in different sets of SNPs included in different cv run, and invalidate the cv process.

Great. The suggestion use same code is to rule out possible errors due to the CV procedure itself so we can focus on the upstream diagnosis. But it is good you found the possible cause!

, I actually prefer doing the cv our ways instead of the fusion way and am thinking about changing the cv part of their code into ours as only doing 5 iteration of cv(FUSIOn)

Sure!

One simple way to debug is to run the modeling again without cross-validation, just fit the entire data and check the distribution of R^2 (no train/test split at all). This way we can have an estimate of the performance, which should be better than but not too far away from cross-validation. But i guess you've identified the bug already anyways.

hsun3163 commented 3 years ago

One simple way to debug is to run the modeling again without cross-validation, just fit the entire data and check the distribution of R^2 (no train/test split at all). This way we can have an estimate of the performance, which should be better than but not too far away from cross-validation. But i guess you've identified the bug already anyways.

I did some thing similar, but instead of doing it for all the genes and get the distribution, I pick some gene with the worst performance and looked at them individually. Luckily enough the first one I found is the bimodal one shown above.

gaow commented 3 years ago

Luckily enough the first one I found is the bimodal one shown above.

I'm confused ... if you run on all data for one gene, you will just have one R^2 and not a distribution, right? Are you saying the R^2 in the entire data-set was done for a few genes and already show this?

I think it makes sense to verify this only on the best performing gene -- the ones you picked as "signals"

hsun3163 commented 3 years ago

Luckily enough the first one I found is the bimodal one shown above.

I'm confused ... if you run on all data for one gene, you will just have one R^2 and not a distribution, right? Are you saying the R^2 in the entire data-set was done for a few genes and already show this?

I think it makes sense to verify this only on the best performing gene -- the ones you picked as "signals"

I chose some genes including the best performing one and some other with the worst performance and see if each of their cv-r2 make sense and also compared it to each of their un-cv r2. The best performing one actually make sense but it is the worst performing ones shown anomaly.

gaow commented 3 years ago

Good. It's okay we focus on the ones that pass the cutoff. I was under the impression that the r^2 for those are bad, as shown in your notebook in our Monday morning meeting. But maybe they are fixed already.

hsun3163 commented 3 years ago

Removal of the X cleaning step did not magically solved the bi modal issues shown above. In the case where the cv produced an estimated Weights with constant intercepts and zero for all other coef, this will artificially produce a high r2. Current remedy is to skip the runs with constant fitted values and computed the mean r2 and pval for the remaining runs and documented the index of the skipped run.

gaow commented 3 years ago

In the case where the cv produced an estimated Weights with constant intercepts and zero for all other coef, this will artificially produce a high r2

Hmm do you know why this happens? Sounds like some random partitions just dont have any signal. Does it happen for both the significant and non-significant TWAS genes? Are we currently focusing this only on significant genes?

hsun3163 commented 3 years ago

In the case where the cv produced an estimated Weights with constant intercepts and zero for all other coef, this will artificially produce a high r2

Hmm do you know why this happens? Sounds like some random partitions just dont have any signal.

Yes, I think that is exactly the reasons, so it would make sense to removed them from cv iterations

Does it happen for both the significant and non-significant TWAS genes? Are we currently focusing this only on significant genes?

It is yet to be determined as they are still running, will update once it is available.

hsun3163 commented 3 years ago

As it turnout, the artificially high r2 is concurrence with a null in r2_raw, where r2 is computed with manually. With the added mechanism, these r2 was skipped.

The problems of constant X columns is also avoided by skipping the runs with a constant X columns.

These patch will removed some of the iterations, but shall not bias the estimated r2.


With the implemented mechanism, the r2 for mv_susie genes are now shall be comparable with that of the fusion/weight.

Among the genes that went in to TWAS/Fusion Association testing

Following are the comparisons of the mean r2 and pval between fusion and mv_susie

Proj_Name | r2 | pval | susie_r2 | susie_pval
Alz_AC_SNP | 0.1447902 | 0.001055535 | 0.1315790 | 0.06206250
Alz_DLPFC_SNP | 0.1271512 | 0.00111693 | 0.1217056 | 0.04960000
Alz_PCC_SNP | 0.1307633 | 0.01012342 | 0.1263915 | 0.08792857

Following are the comparisons of the median r2 and pval between fusion and mv_susie


Proj_Name | r2 | pval | susie_r2 | susie_pval
Alz_AC_SNP | 0.091 | 1.3e-15 | 0.08519466 | 0.0095
Alz_DLPFC_SNP | 0.063 | 5.7e-14 | 0.06123020 | 0.0120
Alz_PCC_SNP | 0.076 | 7e-11 | 0.07834360 | 0.0255

As it occurs, the r2 is comparable between the two method, but not the pval. A more detailed examination shall revealed where the inflation of pvalue occurs.


Looking from another perspective, the number of genes that have a smaller p.value, larger r2, or both are shown respectively:

Proj_Name | cnt_pv_win | cnt_r2_win | cnt_both_win
Alz_AC_SNP        | 13 | 19 | 3
Alz_DLPFC_SNP | 10 | 21 | 3
Alz_PCC_SNP      | 7 | 24 | 2

Following is the percentage version of the table above

pcnt_pv_win | pcnt_r2_win | pcnt_both_win
-- | -- | --

Alz_AC_SNP        | 0.2708333 | 0.3958333 | 0.06250000
Alz_DLPFC_SNP | 0.2500000 | 0.5250000 | 0.07500000
Alz_PCC_SNP     | 0.1666667 | 0.5714286 | 0.04761905

As it indicates, majority of genes have a mvsusie r2 and p.value larger than that of the fusions for tissues DLPFC and PCC.

One of the remaining things could be done is to cv the fusion method our way, but I doubt it would make a lot of difference demonstrated by the following figures:

Screen Shot 2021-03-05 at 5 43 16 PM

Screen Shot 2021-03-05 at 5 43 24 PM

hsun3163 commented 3 years ago

A potential reason that leads to the low p values of the Fusion CV is that, they compute the predicted expression for each segment(folds), rbind the segments together, and compare it to the actual expression as a whole in one linear regression and get one value of r2 and p-value.

This practices will artificially deflate the p-value. Assuming the r2 between the predicted and the real data is the same for each part, pooling predicted values from 5 data-set increase the sample size in calculating the CV p-value for five folds, which in turn inflated the power. The effect of doing so can be illustrated by the simple example below:

> b = tibble(
+   x = c(1:5),
+   y  = c(5,2,6,3,2))
> lm(x~y,data = b)%>%summary()

Call:
lm(formula = x ~ y, data = b)

Residuals:
      1       2       3       4       5 
-1.4697 -1.6061  0.9091  0.7727  1.3939 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   4.3636     1.7870   2.442   0.0923 .
y            -0.3788     0.4524  -0.837   0.4639  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.644 on 3 degrees of freedom
Multiple R-squared:  0.1894,    Adjusted R-squared:  -0.08081 
F-statistic: 0.7009 on 1 and 3 DF,  p-value: 0.4639
> a = tibble(
+   x = c(1:5,1:5,1:5,1:5,1:5,1:5),
+   y  = c(5,2,6,3,2,5,2,6,3,2,5,2,6,3,2,5,2,6,3,2,5,2,6,3,2,5,2,6,3,2)
+ )
> lm(x~y,data = a)%>%summary()

Call:
lm(formula = x ~ y, data = a)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6061 -1.4697  0.7727  0.9091  1.3939 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.3636     0.5849   7.460 3.99e-08 ***
y            -0.3788     0.1481  -2.558   0.0162 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.318 on 28 degrees of freedom
Multiple R-squared:  0.1894,    Adjusted R-squared:  0.1604 
F-statistic: 6.542 on 1 and 28 DF,  p-value: 0.01624

The r2 remains the same, yet the p-values decrease drastically.