Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

When `allele_filp_check = TRUE` , allele frequence should also be filped #44

Closed albert-ying closed 3 years ago

albert-ying commented 3 years ago

Thank you for this wonderful tool. I noticed that when using allele_filp_check = TRUE, the A1/A2 and BETA are fliped according to the reference. However, the FRQ column is unchanged. Actually, I'm not clear what is the FRQ column. It should be "The minor allele frequency (MAF) of the SNP" based on the manual. However, it sometimes appears to be larger than 0.5. It seems like it just copies the frequency column in the raw file without changing anything.

Some software requires FRQ to be the effect allele frequency (e.g., GCTA). Therefore the FRQ should be set as the frequency of A1 by default, and be flipped if needed, especially when the EAF column is detected in the raw file.

Al-Murphy commented 3 years ago

Hi Albert, thanks for bringing this to my attention, I believe this was an oversight and I agree that, in certain instances, the FRQ column should be flipped if the SNP in question's A1 and A2 are flipped.

A note on the logic however, MungeSumstats (1.1.5+) enforces a rule that the effect allele will always be the A2 allele, this is also the approach done for VCF. Flipping will ensure that A1 is the reference allele based on the reference genome.

The FRQ column can be the effect allele frequency as you describe but there are other use cases for the column so we want to be able to incorporate these in our solution so the package remains useable for most instances. Therefore, we do not infer if the FRQ column relates to A1 or A2 but instead, only flip if the SNP in question needs to have their A1 and A2 values flipped. Note there is no enforcement of whether the FRQ value is less than 0.5 but instead is based on the user's inputted value.

I have implemented in v1.1.6+ the flip of the FRQ column which is controlled by the allele_flip_frq input parameter to format_sumstats() which by default is TRUE. Note though that we can only flip the frequency when considering bi-allelic SNPs as if there are more, flipping i.e. 1 minus the inputted value, would not work. So to allow flipping of FRQ you must have bi_allelic_filter set to TRUE which is also the default. MungeSumstats will throw an error if you have bi_allelic_filter=FALSE but have allele_flip_frq=TRUE.

A note on why the FRQ value could possibly be greater than 0.5 which hopefully won't confuse things further. Basically the reference genome (especially hg19 but at times hg38 too) can contain minor alleles. hg19 / GRCh37 was used for more than a decade as the primary reference genome, yet ~70% of the genomic sequence of this genome was based on a single individual from the Buffalo area, New York, USA. This individual carried many 1000s of rare disease susceptibility alleles, more on this here THE REFERENCE HUMAN GENOME DEMONSTRATES HIGH RISK OF TYPE 1 DIABETES AND OTHER DISORDERS

Al-Murphy commented 3 years ago

v1.1.7 has been pushed to Github with this functionality in place. The functionality will propagate to Bioconductor in the next release around September.

albert-ying commented 3 years ago

@Al-Murphy Seems like instead of using 1 - freq, it gives -freq when flipped. image

Al-Murphy commented 3 years ago

Apologies, I have updated in 1.1.8