Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

Problem in running MungeSumstats::format_sumstats with bi_allelic_filter = FALSE #137

Closed rootze closed 1 year ago

rootze commented 1 year ago

Thanks for developing this awesome tool. I have a question regarding dropping a large number of SNPs in GWAS summary statistics. I used the AD GWAS (PubMed ID: 35379992) https://www.ebi.ac.uk/gwas/publications/35379992

Version of MungeSumstats -- MungeSumstats_1.5.18

AD_NG2022_GWAS_Munged <- MungeSumstats::format_sumstats(
                                                        path = AD_NG2022_GWAS,  
                                                        ref_genome="GRCh38", 
                                                        save_path = paste0(AD_NG2022_PATH, 'formatted/', 'AD_NG2022_GWAS_Munged_12122022.txt'), 
                                                        bi_allelic_filter = FALSE)

When running with bi_allelic_filter = FALSE as something I can try after reading: Issue Large number of non-biallelic SNPs #111, I get an error as follow:

Error in check_allele_flip(sumstats_dt = sumstats_return$sumstats_dt,  :
  Certain SNPs need to be flipped along with their effect columns and frequency column. However, to flip the
FRQ column, only bi-allelic SNPs can be considered. It is recommended to set bi_allelic_filter to TRUE so
non-bi-allelic SNPs are removed. Otherwise, set allele_flip_frq to FALSE to not flip the FRQ column but note
this could lead to incorrect FRQ values.

With the above error suggested setting allele_flip_frq to FALSE. (with bi_allelic_filter = FALSE, allele_flip_drop = FALSE,)

AD_NG2022_GWAS_Munged <- MungeSumstats::format_sumstats(path = AD_NG2022_GWAS, ref_genome="GRCh38", save_path = paste0(AD_NG2022_PATH, 'formatted/', 'AD_NG2022_GWAS_Munged_12122022.txt'), bi_allelic_filter = FALSE, allele_flip_drop = FALSE, allele_flip_frq = FALSE)
Summary statistics report:
   - 19,779,968 rows (93.7% of original 21,101,114 rows)
   - 19,741,483 unique variants
   - 5,418 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Done munging in 16.921 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.:

Although 93.7% of original SNPs were kept, the FRQ is not all MAF, as I understand anymore. I thought MAF should be minor allele frequency. But the following message suggested there are a lot of FRQ with values > 0.5. I am a bit confused right now and what I should use.

Warning: When method is an integer, must be >0.
8,476,457 SNPs (42.9%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.

Just giving it a try, when I set the bi_allelic_filter = TRUE and I also set allele_flip_drop = FALSE About 40% of the SNPs were dropped, including some lead SNPs.

Summary statistics report:
    - 13,051,151 rows (61.9% of original 21,101,114 rows)
    - 13,051,151 unique variants
    - 2,975 genome-wide significant variants (P<5e-8)
    - 22 chromosomes
 Done munging in 27.315 minutes.
 Successfully finished preparing sumstats file, preview:

I am in a dilemma. And would be much appreciated your suggestion. Thank you!

Al-Murphy commented 1 year ago

Hey!

Great to hear you are finding the package useful!

Firstly, I would highly recommend installing R 4.2 and Bioconductor 3.16 to install MSS >=v1.6.0. This version has far more functionality than 1.5 that you are using - including a choice of dbSNP versions so you can use the latest version (155) as well as 144.

On your specific problem - the issue you are seeing with check_allele_flip is as expected, we can't flip the effect of a SNP if the direction is incorrect if there is another alternative allele i.e. 1-effect would not be correct in this instance. You correctly state to use allele_flip_frq to avoid this error which will stop this check and leave flip but like you say then the MAF value may not be correct. Another option would be to use allele_flip_drop which will drop any SNPs that are non-bi-allelic and that need to be flipped. I would suggest this approach if you need to use the MAF values downstream.

A broader question is to whether you want to remove non-bi-allelic SNPs all together. This is discussed more here but in short, general advice for downstream analysis is to remove them. Whether the large drop in remaining SNPs will make a difference in downstream analysis is something we are currently testing.

Hope this helps, Alan.

rootze commented 1 year ago

@Al-Murphy Thank you for your help. That made sense.