Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 15 forks source link

Allele frequency cant be flipped for multi-allelic variants. #165

Closed kousathanas closed 1 year ago

kousathanas commented 1 year ago

Hi, when running mungesumstats v1.9.6, I got the following Error:

Error in check_allele_flip(sumstats_dt = sumstats_return$sumstats_dt, : Certain SNPs need to be flipped along with their effect columns and frequency column. However to flip the FRQ column, only bi-allelic SNPs can be considered. It is recommended to set bi_allelic_filter to TRUE so non-bi-allelic SNPs are removed. Otherwise, set allele_flip_frq to FALSE to not flip the FRQ column but note this could lead to incorrect FRQ values.

With ever increasing sample sizes, the majority of positions in the genome will have (rare) multi-allelic variants. It is theoretically possible that with large enough sample sizes, all possible mutations for every single position in the genome will be detected and added to dbSNP. Thus, only keeping bi-allelics (as defined through dbSNP) is not really a viable option: more than half of any dataset -eventually the entirety- will be eliminated by such a filter.

In this context, I would like to know how the above error can be sensibly bypassed while flipping columns. Is the solution to do this procedure manually?

will be glad for any info or feedback on the above issue/error.

best, Thanos

Al-Murphy commented 1 year ago

Hey!

With ever increasing sample sizes, the majority of positions in the genome will have (rare) multi-allelic variants. It is theoretically possible that with large enough sample sizes, all possible mutations for every single position in the genome will be detected and added to dbSNP. Thus, only keeping bi-allelics (as defined through dbSNP) is not really a viable option: more than half of any dataset -eventually the entirety- will be eliminated by such a filter.

I completely agree, this is something we are actively investigating in the lab with regards to the number of non-bi-allelic SNPs across different dbSNP builds. For example, see this issue. I do believe we will be heading towards keeping non-bi-allelic SNPs as the default but this requires checking what effect this will have, for example on commonly used downstream analysis tools - currently these mostly expect bi-allelic SNPs only.

In this context, I would like to know how the above error can be sensibly bypassed while flipping columns. Is the solution to do this procedure manually? It is a hard problem to get around flipping non-bi-allelic SNPs since we would need to know the frequency of the other alternative allele in the same population that the study was conducted in. It's possible we could get the user to specify a population, say European population and use reference databases for these but this would not be accurate for the specific study plus these databases currently don't exist in this form in R. Consider dbSNP which has started to capture this information where a second alternative allele is found in an African population but has never been seen in European, this SNP will be added to dbSNP with the frequency value for Africa only. This could be used to flip frequency in a more accurate manner but is not perfect and would require the R versions of dbSNP releases to hold frequency data which they currently don't.

My advice to sensibly deal with this is to set allele_flip_frq = FALSE and then not to use the frq data as these will not have been flipped (all other effect columns will have been for the necessary SNPs). Or to allele_flip_frq = FALSE and also set imputation_ind = TRUE and then for the SNPs which have been flipped, manually flip the bi-allelic ones and set the non-bi-allelic SNPs to a sensible value, perhaps NA. Again the second approach is only necessary if you need the frequency column for downstream analysis.

I'm open to suggestions if you think MSS could be modified to better deal with your issue, just let me know.

Alan.

kousathanas commented 1 year ago

Hi @Al-Murphy

thank you for the prompt reply. I agree that the problem of flipping allele frequencies for multi-allelics is not trivial.

Three comments:

In this context, I believe that an easy solution for mungesumstats would be to add an option flip_frq_as_biallelic, which will flip allele frequencies as if the variant is bi-allelic, i.e, 1-p. This could be set to FALSE by default. The user can choose to QC-out variants with inconsistent allele frequencies with their population of preference (e.g., gnomAD) which will eliminate any errors introduced in this way. Its not a perfect solution, but you can provide a warning when activated. As the vast majority of multi-allelics are very rare, this will save a large fraction of variants that are effectively bi-allelic.

best, Thanos

Al-Murphy commented 1 year ago

Hey,

In this context, I believe that an easy solution for mungesumstats would be to add an option flip_frq_as_biallelic, which will flip allele frequencies as if the variant is bi-allelic, i.e, 1-p. This could be set to FALSE by default. The user can choose to QC-out variants with inconsistent allele frequencies with their population of preference (e.g., gnomAD) which will eliminate any errors introduced in this way. Its not a perfect solution, but you can provide a warning when activated. As the vast majority of multi-allelics are very rare, this will save a large fraction of variants that are effectively bi-allelic.

Yes I agree with this approach as long as it isn't set as the default. I have updated MSS to incorporate this (v1.9.16) which you can test and let me know if it works as intended for you?

Cheers, Alan.

Al-Murphy commented 1 year ago

Closing because of inactivity. Reopen if the issue isn't resolved for you.

Alan.