Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

sumstatsColHeaders duplicates #148

Closed svenkatesh25 closed 1 year ago

svenkatesh25 commented 1 year ago

1. Bug description

I'm finding duplicate entries in the sumstatsColHeaders file. Most of these duplicates map to a single Corrected file, but I notice "MAJORALLELE" maps to both A1 and A2. Is this intended?

Al-Murphy commented 1 year ago

Hey!

The dataframe sumstatsColHeaders intentionally contains duplicates. The file is used to map raw, inputted sumstats' column headers (values in Uncorrected) to a standardised value (Corrected). So there will be multiple Uncorrected that map to one Corrected to catch all possibilities. Note that if your sumstats contain two Uncorrected that map to the same Corrected the order in which they appear in sumstatsColHeaders matters and the first appearing one will be used for that column. Note you can modify either the sumstats or sumstatsColHeaders to account for this.

MAJORALLELE relates to the reference and alternative allele information which is somewhat of an anomaly - some people interpret A1 as the major (reference) allele whereas others A2. MSS tries to account for this here. In either case, we run a allele flip check where we check for common direction in the A1 and A2 columns (meaning, we check that what we interpret as A1 and A2 i.e. A1 being the ref and A2 the alt) is consistent for all SNPs in the sumstats using a reference dataset. If there are any that aren't correct there A1 and A2 values are swapped and any effect columns are flipped and the FRQ value reversed. These checks should cover any instance people have for naming conventions however, if in your sumstats if you think this is causing issues do share an example code and dataset showing it and I can update MSS.

Cheers, Alan.

Al-Murphy commented 1 year ago

Closing, please feel free to reopen if this didn't answer your question.

Alan.