Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

Inquiry on Converting MAF to EAF Using MungeSumstats #189

Open DaXuanGarden opened 2 months ago

DaXuanGarden commented 2 months ago

I hope this message finds you well. I am currently working on a project involving GWAS summary statistics. Your MungeSumstats package has been incredibly useful for standardizing these datasets, and I appreciate the effort your team has put into developing it.

I have encountered a specific challenge related to the conversion of minor allele frequencies (MAF) to effect allele frequencies (EAF) in my dataset. Given the diverse nature of the summary statistics files and the potential discrepancies in allele definitions, I am seeking your guidance on how best to handle this conversion within the framework of your package.

Could you please advise on the following:

  1. Standard Approach: Does MungeSumstats provide any built-in functionality or recommended practices for converting MAF to EAF, particularly when the effect and minor alleles differ?

  2. Handling Allele Discrepancies: How does the package address discrepancies between the effect allele and the reference allele in standard GWAS formats? Are there specific functions or steps you would recommend for ensuring accurate EAF computation?

  3. Integration with Other Packages: Is there any synergy with other packages, such as gwasvcf or ieugwasr, that can facilitate this conversion process while maintaining data integrity?

  4. User Customization: If manual adjustments are needed for unique cases, what is your advice on implementing these within MungeSumstats without compromising the package's standardization strengths?

I appreciate any insights you can provide and look forward to your expert advice on these matters. Thank you for your time and for creating a tool that significantly enhances data analysis in genetic research.

Best regards,

Daxuan

Al-Murphy commented 2 months ago

Hey Daxuan,

Issues around minor allele frequencies (MAF) and effect allele frequencies (EAF) are quite common in the field down to differing of definitions. This comes back to the mess around differing meanings behind alleles such as A1/A2, effect/non-effect and minor/major alleles. We try to handle all cases in MSS but since this can be quite ambiguous, there are cases where we have to assume the effect and frequency columns relate to the minor allele (i.e. not the allele on the reference genome) to enable a standardised formatting. See infer_effect_column and my answer here for more info on this.

I'll try my best to clarify the problem and will explain how MSS deals with it. Overall though, it is good to set up some test summary stats to check if MSS behaves as you expect (as I do below) or to consult our documentation and related publication.

Integration with Other Packages: Is there any synergy with other packages, such as gwasvcf or ieugwasr, that can facilitate this conversion process while maintaining data integrity?

Yes, so we actually align our output with both in that we process so: "The effect or alternative allele is always assumed to be the second allele (A2), in line with the approach for GWAS-VCF (Lyon et al., 2021)" - from our publication.

User Customization: If manual adjustments are needed for unique cases, what is your advice on implementing these within MungeSumstats without compromising the package's standardization strengths?

It depends, the defaults are sensible but the other parameter choices are there to make it flexible to the users needs without compromising the standardisation principles of MSS. Have a look at the parameters for format_sumstats function and feel free to ask a specific use case but I can't really answer such a vague question without more input from you on what you are trying to do.

Handling Allele Discrepancies: How does the package address discrepancies between the effect allele and the reference allele in standard GWAS formats? Are there specific functions or steps you would recommend for ensuring accurate EAF computation?

MSS has quite a robust approach to figuring out the effect from the non-effect allele with multiple checks to interpret the column headers inputted and also using reference genomes. Read through our documentation and come back to me with any specific questions.

Standard Approach: Does MungeSumstats provide any built-in functionality or recommended practices for converting MAF to EAF, particularly when the effect and minor alleles differ?

So there are quite a few things going on in your question here. Firstly, how would you know to convert MAF to EAF? The minor allele is more commonly the effect allele - think in a GWAS you would be measuring the effect of the SNP that isn't on the reference genome. I know it can happen that this isn't the case though which MSS will try to infer as discussed in my first paragraph. Secondly, you can only change the noted MAF by flipping it which should only be done if the SNP is Bi-allelic i.e. there is only one known SNP in the population so it can be flipped with 1-MAF. Note, MSS does check for this and will flip MAFs at a SNP level if they look different to the rest in the sumstats. Hzve a look at our documentation for more info on this.

And just an example of how to test how MSS would handle certain cases. Imagine we have a sumstats that looks like this:

ss <- data.table("SNP"="rs1182172","CHR"=7,"BP"=2838469,"A1"="G", "A2"="A", 
                 "FRQ"=0.361569,"BETA"=-0.0453284, "SE"=0.00988289, 
                 "P"=4.5063e-06)

Here A2 is the effect allele and FRQ/Beta relates to it. If we run MSS, it will pick up on this:

> formatted_ss <- MungeSumstats::format_sumstats(ss,return_data = TRUE)

******::NOTE::******
 - Formatted results will be saved to `tempdir()` by default.
 - This means all formatted summary stats will be deleted upon ending the R session.
 - To keep formatted summary stats, change `save_path`  ( e.g. `save_path=file.path('./formatted',basename(path))` ),   or make sure to copy files elsewhere after processing  ( e.g. `file.copy(save_path, './formatted/' )`.
 ******************** 

Formatted summary statistics will be saved to ==>  /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpM3XzqD/file4c034ee041f.tsv.gz
Infer Effect Column
First line of summary statistics file: 
SNP CHR BP  A1  A2  FRQ BETA    SE  P   
Allele columns are ambiguous, attempting to infer direction
Standardising column headers.
First line of summary statistics file: 
SNP CHR BP  A1  A2  FRQ BETA    SE  P   
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 4 seconds.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 3 seconds.
Effect/frq column(s) relate to A2 in the inputted sumstats
Found direction from matching reference genome - NOTE this assumes non-effect allele will match the reference genome
Standardising column headers.
First line of summary statistics file: 
SNP CHR BP  A1  A2  FRQ BETA    SE  P   
Summary statistics report:
   - 1 rows
   - 1 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 1 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Inferring genome build.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 3 seconds.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 3 seconds.
Inferred genome build: GRCH38
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Checking for incorrect base-pair positions
Coercing BP column to numeric.
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 3 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Checking for bi-allelic SNPs.
Sorting coordinates with 'data.table'.
Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpM3XzqD/file4c034ee041f.tsv.gz
Summary statistics report:
   - 1 rows (100% of original 1 rows)
   - 1 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 1 chromosomes
Done munging in 0.425 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
         SNP   CHR      BP     A1     A2      FRQ       BETA         SE          P
      <char> <int>   <int> <char> <char>    <num>      <num>      <num>      <num>
1: rs1182172     7 2838469      G      A 0.361569 -0.0453284 0.00988289 4.5063e-06
Returning data directly.

But now imagine that the effect and FRQ actually related to the major, reference Allele (A1) instead:

#effect and frequency relate to A1, the major/reference genome allele
ss <- data.table("SNP"="rs1182172","CHR"=7,"BP"=2838469,"A1"="G", "A2"="A", 
                 "FRQ"=0.638431,"BETA"=-0.0453284, "SE"=0.00988289, 
                 "P"=4.5063e-06)

Now if we run MSS we get:

> formatted_ss <- MungeSumstats::format_sumstats(ss,return_data = TRUE)

******::NOTE::******
 - Formatted results will be saved to `tempdir()` by default.
 - This means all formatted summary stats will be deleted upon ending the R session.
 - To keep formatted summary stats, change `save_path`  ( e.g. `save_path=file.path('./formatted',basename(path))` ),   or make sure to copy files elsewhere after processing  ( e.g. `file.copy(save_path, './formatted/' )`.
 ******************** 

Formatted summary statistics will be saved to ==>  /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpM3XzqD/file4c0734e4726.tsv.gz
Infer Effect Column
First line of summary statistics file: 
SNP CHR BP  A1  A2  FRQ BETA    SE  P   
Allele columns are ambiguous, attempting to infer direction
Standardising column headers.
First line of summary statistics file: 
SNP CHR BP  A1  A2  FRQ BETA    SE  P   
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 4 seconds.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 3 seconds.
Effect/frq column(s) relate to A2 in the inputted sumstats
Found direction from matching reference genome - NOTE this assumes non-effect allele will match the reference genome
Standardising column headers.
First line of summary statistics file: 
SNP CHR BP  A1  A2  FRQ BETA    SE  P   
Summary statistics report:
   - 1 rows
   - 1 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 1 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Inferring genome build.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 4 seconds.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 3 seconds.
Inferred genome build: GRCH38
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Checking for incorrect base-pair positions
Coercing BP column to numeric.
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 3 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Checking for bi-allelic SNPs.
1 SNPs (100%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted  summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.B, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, ALL_AF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Sorting coordinates with 'data.table'.
Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpM3XzqD/file4c0734e4726.tsv.gz
Summary statistics report:
   - 1 rows (100% of original 1 rows)
   - 1 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 1 chromosomes
Done munging in 0.442 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
         SNP   CHR      BP     A1     A2      FRQ       BETA         SE          P
      <char> <int>   <int> <char> <char>    <num>      <num>      <num>      <num>
1: rs1182172     7 2838469      G      A 0.638431 -0.0453284 0.00988289 4.5063e-06
Returning data directly.

Specifically note the message:

SNPs (100%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted  summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.B, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, ALL_AF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.

So MSS assumes the effect measurement and FRQ relate to A2 as how could it know it was A1? It warns the user that the frequency looks weird but leaves it up to the user to address this and rerun if necessary. However if we rename A1 and A2 as effect and non-effect, which MSS can more easily interpret since they aren't as ambiguous as A1/A2 look what happens:

> formatted_ss <- MungeSumstats::format_sumstats(ss,return_data = TRUE)

******::NOTE::******
 - Formatted results will be saved to `tempdir()` by default.
 - This means all formatted summary stats will be deleted upon ending the R session.
 - To keep formatted summary stats, change `save_path`  ( e.g. `save_path=file.path('./formatted',basename(path))` ),   or make sure to copy files elsewhere after processing  ( e.g. `file.copy(save_path, './formatted/' )`.
 ******************** 

Formatted summary statistics will be saved to ==>  /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpM3XzqD/file4c03520b7.tsv.gz
Infer Effect Column
First line of summary statistics file: 
SNP CHR BP  effect_allele   non_effect_allele   FRQ BETA    SE  P   
Standardising column headers.
First line of summary statistics file: 
SNP CHR BP  effect_allele   non_effect_allele   FRQ BETA    SE  P   
Summary statistics report:
   - 1 rows
   - 1 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 1 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Inferring genome build.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 4 seconds.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 3 seconds.
Inferred genome build: GRCH38
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Checking for incorrect base-pair positions
Coercing BP column to numeric.
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 3 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 1 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Checking for bi-allelic SNPs.
Sorting coordinates with 'data.table'.
Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpM3XzqD/file4c03520b7.tsv.gz
Summary statistics report:
   - 1 rows (100% of original 1 rows)
   - 1 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 1 chromosomes
Done munging in 0.24 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
         SNP   CHR      BP     A1     A2      FRQ      BETA         SE          P
      <char> <int>   <int> <char> <char>    <num>     <num>      <num>      <num>
1: rs1182172     7 2838469      G      A 0.361569 0.0453284 0.00988289 4.5063e-06
Returning data directly.

Specifically, note:

Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 1 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.

So here,. it flipped the effect and FRQ so they relate to the minor allele (A2) to standardise it.

Testing like this should give you an intuition of the checks MSS is doing and how it will work for your case and what you would need to do with your data.

Alan.