chr1swallace / coloc

Repo for the R package coloc
138 stars 44 forks source link

prepare the data structure for coloc.abf and loop for multiple variants running together #118

Open manL23 opened 1 year ago

manL23 commented 1 year ago

Hi, Thanks for creating this package. I tried to run colocalization with GWAS sumstats and mQTL data extracted from GoDMC dataset. However, there're some issues about the data structure to run coloc.abf with multiple snps together in a loop. Here're the details of my datasets and problematic outputs from coloc.abf

GWAS summary statistics as data frame

'data.frame': 41882 obs. of 8 variables: $ snp : chr "rs573419147" "rs557642992" "rs7454868" "rs574909700" ... $ chr : int 17 17 6 6 6 6 6 6 6 6 ... $ position : int 44327370 44365374 26799828 28832788 28833101 28840885 28843311 28958399 28961100 29266673 ... $ maf.gwas : num 0.787 0.769 0.914 0.922 0.929 ... $ beta.gwas : num 1.07 1.07 1.25 1.25 1.28 ... $ SE : num 0.0149 0.0142 0.0205 0.0173 0.0179 ... $ p.gwas : num 8.07e-06 8.11e-06 2.80e-27 3.86e-38 1.96e-42 ... $ varbeta.gwas: num 0.000223 0.000201 0.000422 0.000299 0.000322 ...

mQTL data from GoDMC

data.frame': 13234 obs. of 8 variables: $ snp.mqtl : chr "rs7074491" "rs7534848" "rs7545236" "rs1783551" ... $ name : chr "chr10:134813720:SNP" "chr1:169530093:SNP" "chr1:169530070:SNP" "chr11:75231212:SNP" ... $ cpg : chr "cg00320980" "cg16054275" "cg16054275" "cg15526825" ... $ beta.mqtl : num -0.349 0.411 0.411 0.496 0.45 ... $ se.mqtl : num 0.0093 0.011 0.011 0.0132 0.012 ... $ samplesize : num 21983 27748 27748 25694 26642 ... $ p.mqtl : num 2.38e-308 3.17e-308 3.17e-308 3.71e-308 4.17e-308 ... $ varbeta.mqtl: num 8.64e-05 1.20e-04 1.20e-04 1.75e-04 1.44e-04 ...

• Merge two datasets into a single one

• make lists of lists as required by coloc.abf () without loop function

D1= list(beta=df$beta.gwas, varbeta=df$varbeta.gwas, MAF=df$maf.gwas, type="cc", s=40675/105318, N=105318, snp=df$snp)

The D1 looks as below:

List of 7 $ beta : num [1:44] 1.12 1.05 1.12 1.13 1.12 ... $ varbeta: num [1:44] 1.45e-04 9.29e-05 1.47e-04 1.45e-04 1.47e-04 ... $ MAF : num [1:44] 0.811 0.559 0.812 0.806 0.809 ... $ type : chr "cc" $ s : num 0.386 $ N : num 105318 $ snp : chr [1:44] "rs1104871" "rs11223635" "rs11967852" "rs11967935"

D2 =list(beta=df$beta.mqtl, varbeta=df$varbeta.mqtl, MAF=df$maf.gwas, type="quant", N=27750, snp=df$snp)

List of 6 $ beta : num [1:44] 0.1109 -0.0869 0.1123 0.1096 0.1095 ... $ varbeta: num [1:44] 1.13e-04 7.64e-05 1.09e-04 1.10e-04 1.11e-04 ... $ MAF : num [1:44] 0.811 0.559 0.812 0.806 0.809 ... $ type : chr "quant" $ N : num 27750 $ snp : chr [1:44] "rs1104871" "rs11223635" "rs11967852" "rs11967935" .

coloc.abf(D1,D2)

nsnps PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 44 0.000000e+00 6.668867e-72 0.000000e+00 1.000000e+00 3.177685e-52 However, I want to run the analysis on each SNP, and gives ~ different results per SNPs. So I wrote a loop as below Set datalists with loop for coloc.abf () function datalist1<- lapply(1:length(df$snp), function(i) { d1[[i]] = list(snp =df$snp[i],beta=df$beta.gwas[i],varbeta=df$varbeta.gwas[i], MAF=df$maf.gwas[i], type="cc", s=40675/105318, N=105318) })

datalist2 <- lapply(1:length(df$snp), function(i) { d2[[i]] = list(snp =df$snp[i],beta=df$beta.mqtl[i],varbeta=df$varbeta.mqtl[i], MAF=df$maf.gwas[i], type="quant", N=27750) })

res <- lapply(1:length(df$snp),function(i) coloc.abf(datalist1[[i]],datalist2[[i]])$summary) do.call("rbind", res)

However, it seems that the code only run coloc with the same SNP per row

  nsnps PP.H0.abf    PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf

[1,] 1 0 4.108072e-22 0 0 1 [2,] 1 0 7.421940e-20 0 0 1 [3,] 1 0 1.642475e-23 0 0 1 [4,] 1 0 3.604846e-22 0 0 1 [5,] 1 0 6.957833e-22 0 0 1 ....... [40,] 1 0 1.937698e-23 0 0 1 [41,] 1 0 1.613592e-23 0 0 1 [42,] 1 0 1.571302e-23 0 0 1 [43,] 1 0 1.703198e-23 0 0 1 [44,] 1 0 7.755652e-23 0 0 1

Not sure if the issue cause by the data structures (it seemed fine as check_dataset function returned NULL) before running coloc.abf, or the loop function. Could you please provide more details about how to prepare the right datalists to coloc.abf for multiple SNPs and how to run multiple variants together? Many thanks!