LieberInstitute / 10xPilot_snRNAseq-human

10x snRNA-seq study on 5 postmortem human brain regions across the reward circuitry: NAc, AMY, sACC, DLPFC, and HPC
21 stars 9 forks source link

Check overlap with GWAS snps #1

Closed lcolladotor closed 4 years ago

lcolladotor commented 4 years ago

Before proceeding with the plink bed file created in https://github.com/LieberInstitute/10xPilot_snRNAseq-human/blob/1e1d2bb6627b06c22decdd642178bd25f9aeb42e/twas/filter_data/filter_snps.R#L51-L53, it might be useful to check manually how many SNPs overlap those in a GWAS like https://nealelab.github.io/UKBB_ldsc/h2_summary_20116_0.html. Use https://github.com/LieberInstitute/10xPilot_snRNAseq-human/blob/1e1d2bb6627b06c22decdd642178bd25f9aeb42e/twas/filter_data/filter_snps.R#L62 for the map between hg19 and hg38 coordinates. See in https://github.com/LieberInstitute/brainseq_phase2/blob/master/twas/psycm/convert_to_hg38.R how some of these checks were done for the brainseq phase 2 SNPs and the CLOZUK SCZD GWAS.

Potentially it's best to compute the TWAS weights with more SNPs even if they are absent on the GWAS side. From https://github.com/LieberInstitute/fusion_twas/blob/jhpce/FUSION.assoc_test.R#L175-L178 my understanding is that if the GWAS side is missing SNPs present in the weights, then they are labeled as NAs. So we might be ok to go with the output from https://github.com/LieberInstitute/10xPilot_snRNAseq-human/blob/1e1d2bb6627b06c22decdd642178bd25f9aeb42e/twas/filter_data/filter_snps.R#L51-L53, but prior to the compute-intensive step of the TWAS weights computation it's best to have a general idea of how overlap there is.

Save your comparison code under the twas/filter_data directory or create one like twas/gwas_check or something like that.

aseyedia commented 4 years ago

Using convert_to_hg38.R, I've managed to generate some initial observations:

# this command returns false
identical(length(unique(ids)), length(ids))
[1] FALSE

> table(duplicated(ids))
   FALSE     TRUE
13773232    18235

# approx 55k snps are duplicates
> length(unique(ids_snpmap))
[1] 10,933,669
> length(ids_snpmap)
[1] 10,988,411

> table(gwas_test$variant %in% snpMap$snp)
   FALSE     TRUE
13352549   438918

table(ids %in% ids_snpmap)
# > table(ids %in% ids_snpmap)
#
#   FALSE    TRUE
# 5627759 8163708

As you can see with table(gwas_test$variant %in% snpMap$snp), the overlap between snpMap and the GWAS set you linked me was incredibly small (TRUE/FALSE = 0.03287147644) but that seems due to a difference in formatting (the snpMap has RSIDs). table(ids %in% ids_snpmap) yielded what is probably a more reliable result because both columns are formatted ad hoc (TRUE/FALSE ~ 1.5).

I have saved the code under twas/gwas_check.R.