High percentage of A1 not matching the reference genome

lpgilchrist commented 2 years ago

Hi,

First thanks for the amazing R package! It's a great way to ensure standardisation across multiple GWAS summary statistics and the sort of thing I have been looking for for quite a while.

I am currently running the format_sumstats() function chr by chr and have noticed I am getting a high percentage of SNPs where A1 does not match the reference genome, even though I know this is the reference allele and that the other allele – set as A2 for the purposes of MungeSumstats – is the effect allele.

This seems a bit unusual so just wanted to check.

In this case of 750,404 SNPs, A1 for 661,360 do not match the reference genome on chr 1.

I have included output from running on chr 1.

Standardising column headers.
First line of summary statistics file: 
SNP CHR BP  A1  A2  EAF BETA    SE  INFO    N   P   
Summary statistics report:
   - 750,404 rows
   - 749,014 unique variants
   - 4 genome-wide significant variants (P<5e-8)
   - 1 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
6 SNP IDs are not correctly formatted. These will be corrected from the reference genome.
Found  Indels. These won't be checked against the reference genome as it does not contain Indels.
WARNING If your sumstat doesn't contain Indels, set the indel param to FALSE & rerun MungeSumstats::format_sumstats()
38,783 SNP IDs appear to be made up of chr:bp, these will be replaced by their SNP ID from the reference genome
Found  Indels. These won't be checked against the reference genome as it does not contain Indels.
WARNING If your sumstat doesn't contain Indels, set the indel param to FALSE & rerun MungeSumstats::format_sumstats()
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading reference genome data.
Standardising column headers.
First line of summary statistics file: 
SNP CHR BP  A1  A2  EAF BETA    SE  N   P   
Summary statistics report:
   - 750,404 rows
   - 749,014 unique variants
   - 4 genome-wide significant variants (P<5e-8)
   - 1 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
6 SNP IDs are not correctly formatted. These will be corrected from the reference genome.
Found  Indels. These won't be checked against the reference genome as it does not contain Indels.
WARNING If your sumstat doesn't contain Indels, set the indel param to FALSE & rerun MungeSumstats::format_sumstats()
38,783 SNP IDs appear to be made up of chr:bp, these will be replaced by their SNP ID from the reference genome
Found  Indels. These won't be checked against the reference genome as it does not contain Indels.
WARNING If your sumstat doesn't contain Indels, set the indel param to FALSE & rerun MungeSumstats::format_sumstats()
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading reference genome data.
Found 50186 Indels. These won't be checked against the reference genome as it does not contain Indels.
WARNING If your sumstat doesn't contain Indels, set the indel param to FALSE & rerun MungeSumstats::format_sumstats()
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 22 SNPs where neither A1 nor A2 match the reference genome. These will be removed.
There are 661,360 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.

Thanks,

Lachlan

Al-Murphy commented 2 years ago

Hey Lachlan,

Glad you are finding it useful. I'll need a few things to try and understand the issue, can you give me the version of MungeSumstats you are using? I will also need a small example dataset with code where you show how this is occurring and why you wouldn't expect it to happen for these SNPs. Think of it as similar to posts on stack overflow. Also you can use the log_folder_ind parameter to save all the filtered out SNPs into txt files split based on the reason they were filtered out. Also use the imputation_ind to show where data has been changed for a SNP. The later here can be used to show where the direction of a SNP has been flipped. You can use this to filter to the 661,360 SNPs where the A1 didn't match the reference that MSS has flipped.

Thanks, Alan.

Al-Murphy commented 2 years ago

Closing for now, if you do get the information above I need, feel free to re-open.

Thanks, Alan.

Al-Murphy / MungeSumstats

High percentage of A1 not matching the reference genome #108