borangao / MESuSiE

12 stars 0 forks source link

test gene result from 3 population MESuSiE run #2

Closed arkyl closed 5 months ago

arkyl commented 5 months ago

Hi, Thanks much for the software and timely reply! We have run the MESuSiE on our QTL data from 3 populations: NHW(non-Hispanic White), AA(African American) and Hispanic, with sample size of NHW >1000 and AA/Hisp each <200. Here are some questions that we hope you could give some advice on:

We did some extensive testing on one of our genes that we know containing strong QTL signals. Our default setting is ("AA","Hisp","NHW"), L=10. We found:

1) the order of input pop groups affect the results. E.g our default ("AA","Hisp","NHW") gives 10 cs, while ("NHW","Hisp","AA") gives 8 cs.

2) test on effect of L on cs number: 10 cs when L=10; 17 cs when L=20, with L20 max; 34 cs when L50, with L50 max. So number of cs didn't saturate even when L=50. Is this ok?

3) compare original QTL vs MESuSiE configuration: The MESuSiE configuration seems to be different from what I would expect from original QTL:

e.g: L2 contains single snp 552 (rs637571), configured as AA_Hisp(pip_config:6.810229e-01). The Z of snp 552: NHW Z=11.64;Hisp Z=4.97; AA Z=0.757. I would have expected that AA_Hisp would not be strong as it contains AA which is not significant QTL. The all config for snp 552: AA Hisp NHW AA_Hisp AA_NHW 1.142324e-113 6.823896e-02 1.848905e-96 6.810229e-01 2.554795e-98 Hisp_NHW AA_Hisp_NHW 2.283364e-02 2.279045e-01

e.g: L5 contains single snp 509 (rs667555), configured as Hisp(pip_config:5.410980e-01). The Z of snp 509: Z=15.50; Hisp Z=5.7; AA=1.2. I would have expected that the snp is not Hisp specific as NHW is quite strong QTL. The all config for snp 509: AA Hisp NHW AA_Hisp AA_NHW 1.257299e-113 5.410980e-01 2.995648e-79 2.089015e-01 4.555976e-81 Hisp_NHW AA_Hisp_NHW 1.803665e-01 6.963401e-02

4) lead QTL in MESuSiE cs when L changes: the 4 top QTL in the gene are in high LD and have pval<1e-100 in NHW, with lead QTL Z>25. When L=5 or L=10, the lead QTL is the single-snp cs L1, with config "AA_Hisp_NHW". When L=20 or L=50, the 4 top QTLs are not in cs, with pip near 0 for all 4. For L=20, the "AA_Hisp_NHW" cs contain no snp in strong LD with lead QTL; for L=50, the "AA_Hisp_NHW" cs contain snp in LD (r=0.875) with lead QTL.

So overall it seems that L=5 or L=10 gives better result in terms of keeping lead QTL signals.

5) test on two pop: we tested on ("AA","NHW"), ("AA","Hisp") and ("Hisp","NHW"). The ("AA","NHW") gives 2 cs when L=10. The (Hisp","NHW") gives 10 cs when L=10; 15 cs when L=20, with L18 max. The ("AA","Hisp") gives 8 cs when L=10, with L10 max; 16 cs when L20, with L20 max; 24 cs when L=50, with L48 max.

So in run with two races, we see stable cs in ("AA","NHW"), but perhaps not ("AA","Hisp") and ("Hisp","NHW").

6) pip_config sum >1: Not in this test gene, but in other genes, we see cases where some snp has the sum of all pip_config more than 1. e.g: AA Hisp NHW AA_Hisp AA_NHW Hisp_NHW 0.0000000 0.5004338 0.0000000 0.7353165 0.0000000 0.1798680 AA_Hisp_NHW 0.2646835 For this snp, the QTL is strong in all pop: NHW Z=22.8; Hisp Z=8.3; AA Z=8.4. Can the sum of all pip_config be more than 1, as in the example the sum is 1.68?

7) strong effect of ancestry_weight on results: we test how the change of ancestry_weight affect the results. We change the ancestry_weight to ancestry_weight=rep(1/7,7) from default. The results are quite different. The new result contains 10 cs with all categorized as "AA_Hisp_NHW" except for two "AA_Hisp", while the default result contains 4 "Hisp", 4 "AA_Hisp", 1 "AA_Hisp_NHW" and 1 "Hisp_NHW". I didn't expect that the prior change would affect the result this much.

The following are not questions per se but just want to confirm the results are ok:

8) snps in more than one cs: there are snps in more than one cs,i.e overlap between cs. e.g: L4(snp 483 486 492) is contained by L9 (snp 467 483 486 492). 9) max_iter on result: We found that max_iter=500 when L=10 gives different result (in terms of number of snps with pip>0.5) from default max_iter=100. When we increase L, the max_iter=500 result is different from max_iter=100. When we set L=5, the max_iter=500 result seems to be same as max_iter=100.

Sorry for the lengthy questions. Thanks a lot for your time and advice on this. I attached a RDat for the test gene (test_gene.MESuSiE_res.RDat.zip), which contains LD_list, summ_stat_list and MESuSiE_res when L=10. We would really like to work out the potential problems. For now, it seems that the two pop of ("AA","NHW") gives the stable result.

test_gene.MESuSiE_res.RDat.zip

Best, Yue

borangao commented 5 months ago

Hi Yue,

Thanks for your continuous interest in MESuSiE. I understand it can be very frustrating when the results are not expected.

For majority of the issue posted here, I think they are all related to the severe LD mismatch for Hispanic population. I would suggest briefly go over the data processing pipeline provided here Data Preparation. You can follow the recent fine-mapping paper like CARMA or SLALOM for how LD mismatch can impact the fine-mapping result.

I would like to go over the questions one by one.

To diagnose the LD mismatch issue, I used the following code in R: load("test_gene.MESuSiE_res.RDat") library(susieR) AA_LD_Check <- kriging_rss(summ_stat_list$AA$Z, LD_list$AA) AA_LD_Check$plot Hisp_LD_Check <- kriging_rss(summ_stat_list$Hisp$Z, LD_list$Hisp) Hisp_LD_Check$plot NHW_LD_Check <- kriging_rss(summ_stat_list$NHW$Z, LD_list$NHW) NHW_LD_Check$plot

The diagnostic plot for AA is here

image

The diagnostic plot for Hisp is here

image

The diagnostic plot for NHW is here

image

You would expect no LD mismatch if all the dots are on the diagonal line like in NHW. We can see there are obvious LD mismatch in AA and Hisp. I tried to do the QC for you, but it seems that there is LD mismatch even after QC I did, especially for Hisp population.

I used the following code to remove the LD mismatch SNPs.

AA_LD_Check <- kriging_rss(summ_stat_list$AA$Z, LD_list$AA)

AA_test_summ <- summ_stat_list$AA AA_test_LD <- LD_list$AA rm_SNP_AA_all<-c() mismatch_num = 1 n_iter = 0 while(mismatch_num > 0) { AA_LD_Check <- kriging_rss(AA_test_summ$Z, AA_test_LD) rm_SNP_AA <- which(AA_LD_Check$conditional_dist$logLR > 2) rm_SNP_AA_all<-c(rm_SNP_AA_all,AA_test_summ$SNP[rm_SNP_AA]) mismatch_num <- length(rm_SNP_AA)
if(mismatch_num > 0) { AA_test_summ <- AA_test_summ[-rm_SNP_AA, ] AA_test_LD <- AA_test_LD[-rm_SNP_AA, -rm_SNP_AA] } n_iter = n_iter + 1 }

After removing mismatched LD for Hispanic

AA_LD_Check$plot

image

AA_LD_Check$plot

Remove Hispannic mismatched SNP

Hisp_test_summ<-summ_stat_list$Hisp Hisp_test_LD<-LD_list$Hisp rm_SNP_Hisp_all<-c() mismatch_num = 1 n_iter = 0 while(mismatch_num>0){ Hisp_LD_Check<-kriging_rss(Hisp_test_summ$Z,Hisp_test_LD) rm_SNP_Hisp<-which(Hisp_LD_Check$conditional_dist$logLR>2) rm_SNP_Hisp_all<-c(rm_SNP_Hisp_all,Hisp_test_summ$SNP[rm_SNP_Hisp]) mismatch_num<-length(rm_SNP_Hisp)
if(mismatch_num>0){ Hisp_test_summ<-Hisp_test_summ[-rm_SNP_Hisp,] Hisp_test_LD<-Hisp_test_LD[-rm_SNP_Hisp,-rm_SNP_Hisp] } n_iter = n_iter + 1 }

Hisp_LD_Check$plot

image

You could clearly see there are clustered SNPs are far off the diagonal line after trying to remove the LD mismatch ( even after 6 iterations).

So I would suggest check the data quality of Hispanic population before moving on.

I try to answer the remaining question by by replacing Hisp by AA in the real data as data quality issue of Hisp population. I used the following code to check the impact of number of effect and order of ancestry group on MESuSiE result. rm_snp<-which(summ_stat_list$AA$SNP%in%rm_SNP_AA_all)

summ_stat_list_qc<-lapply(summ_stat_list,function(x)x[-rm_snp,]) LD_list_qc<-lapply(LD_list,function(x)x[-rm_snp,-rm_snp])

summ_stat_list_qc$Hisp<-summ_stat_list_qc$AA LD_list_qc$Hisp<-LD_list_qc$AA

Number of Effect on result

MESuSiE_rerun<-meSuSie_core(LD_list_qc,summ_stat_list_qc,L=10) MESuSiE_rerun_L20<-meSuSie_core(LD_list_qc,summ_stat_list_qc,L=20)

Iteration on result

MESuSiE_rerun_iter500<-meSuSie_core(LD_list_qc,summ_stat_list_qc,L=10,max_iter = 500)

Switch ancestry order

summ_stat_list_qc_reorder<-list() LD_list_qc_reorder<-list() summ_stat_list_qc_reorder[[1]]<-summ_stat_list_qc$NHW summ_stat_list_qc_reorder[[2]]<-summ_stat_list_qc$AA summ_stat_list_qc_reorder[[3]]<-summ_stat_list_qc$Hisp names(summ_stat_list_qc_reorder)<-c("NHW","AA","Hisp")

LD_list_qc_reorder[[1]]<-LD_list_qc$NHW LD_list_qc_reorder[[2]]<-LD_list_qc$AA LD_list_qc_reorder[[3]]<-LD_list_qc$Hisp names(LD_list_qc_reorder)<-c("NHW","AA","Hisp")

MESuSiE_rerun_L10_reorder<-meSuSie_core(LD_list_qc_reorder,summ_stat_list_qc_reorder,L=10)

It turns out the result is quite consistent with L and ordering of the group. For all the analysis, I found there are two credible sets identified, cs1 "rs58568715", cs 2 "rs1151523" "rs689274" "rs556872" "rs669371" "rs526631" . Both cs are shared across ancestries. image

For question 6 & 8. We also observed such scenarios when there is a LD mismatch and the same SNP is selected in multiple credible set. So again, please check the LD mismatch before running the analysis.

For question 7.

I complete agree ancestry_weight can impact the result of the MESuSiE. In the paper, we strongly suggest to favoring the selection of ancestry-specific effect. An equal weight to shared and ancestry-specific effect may result in favoring selection of shared effect. Though not presented in our manuscript, we found that when ancestry/shared weight is greater than 3, the ROC curve does not change much. But for here using rep(1/7,7) may not be ideal.

Hope that clarifies all questions. Please feel free to ask if you have further question!

Best, Boran

arkyl commented 5 months ago

Hi, Thanks very much for the detailed diagnosis and analysis! We ran the QTL from individual level data and obtained the LD matrix from the same data set. So I guess unlike the "mismatch" from external LD set, these "mismatch" are probably true, caused by complexed LD structure. The Hispanic pop as a mixed ancestry group especially presents a challenge here. I will check our data and test more based on your suggestions. Again thanks a lot for your detailed reply and suggestions!

Best, Yue

borangao commented 5 months ago

Hi Yue,

As mentioned, you retrieved summary statistics and in sample LD. I am very interested in applying MESuSiE in admixture population. I am wondering if it is possible for you following the same strategy employed in Tractor. You can probably get European and African specific GWAS summary statistic and LD from Hispanic like Tractor. You can then apply MESuSiE to it and check the result. That is the research we are very interested in and trying to explore.

Best, Boran

arkyl commented 5 months ago

Hi, Thanks much for your advice! After careful examination of the data, we found the problem was actually from our step of generating LD matrix with matched alleles. We had used a allele-matching plink data set and run ld matrix using plink. However, plink ignored the A1 in .bim but still used minor allele on fly to calculate the ld matrix; so the sign of ld for some snps were flipped. After we corrected this problem for the test gene, every pop, NHW, AA and Hisp has no problem in LD mismatch check; they all follow a narrow diagonal line in kriging plot. The result of MESuSiE was stable, two cs when L=10, with one cs containing lead QTL. I think the result looks quite good to us. Just one thing to confirm; we found that all the 985 snps except for 4, had identical pip (0.001015228). The 4 snps of exception include the snps in cs, so that's where is interesting; but just want to confirm that this pip pattern is ok:

MESuSiE_res$pip[1:5] [1] 0.001015228 0.001015228 0.001015228 0.001015228 0.001015228 which(abs(MESuSiE_res$pip-0.001015228)>1e-06) [1] 423 494 501 547 MESuSiE_res$pip[c(423,547)] [1] 0.001138107 0.047236120 MESuSiE_res$pip[c(494,501)] [1] 0.9527613 0.9701777

Best, Yue

borangao commented 5 months ago

Looks good to me. The prior probability of a SNP being causal is 1/985 = 0.001015228. Currently we are reporting 95% credible set. If you further set it to 99% credible set, then all the 4 SNPs will be included in the cs. Let me know if this clarifies.

Best, Boran

arkyl commented 5 months ago

Great. Thanks for the explanation! Please close the thread.