Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

MungeSumstats appears to be flipping alleles (and with it the `beta` and `FRQ`'s) when it shouldn't #138

Closed jksull closed 1 year ago

jksull commented 1 year ago

Firstly I would just like to thank you for you continued efforts in helping us all deal with the nightmare of non-standardised GWAS summary statistics. I've recently come across this tool and I believe it has the potential to make my (and many others) work enormously more efficient.

My tool versions are:

1. Bug description

Using the following raw summary statistics dataset, the MSS flips the allele frequency (and beta) when it doesn't seem like it should:

Before munging

variant minor_allele    minor_AF    expected_case_minor_AC  low_confidence_variant  n_complete_samples  AC  ytx beta    se  tstat   pval
1:15791:C:T T   5.42862e-09 3.91078e-05 true    361194  3.92157e-03 0.00000e+00 -3.56322e+00    2.53027e+01 -1.40823e-01    8.88009e-01
1:69487:G:A A   5.74891e-06 4.14151e-02 true    361194  4.15294e+00 0.00000e+00 -1.13132e-02    4.95620e-02 -2.28264e-01    8.19441e-01
1:69569:T:C C   1.87738e-04 1.35246e+00 true    361194  1.35620e+02 9.72549e-01 -2.70834e-03    8.88871e-03 -3.04695e-01    7.60599e-01
1:139853:C:T    T   5.66205e-06 4.07894e-02 true    361194  4.09020e+00 0.00000e+00 -1.11344e-02    4.95635e-02 -2.24648e-01    8.22253e-01
1:692794:CA:C   C   1.10640e-01 7.97048e+02 false   361194  7.99247e+04 7.74761e+02 -3.29738e-04    4.10525e-04 -8.03211e-01    4.21853e-01
1:693731:A:G    G   1.15830e-01 8.34441e+02 false   361194  8.36744e+04 8.05114e+02 -3.90192e-04    3.87896e-04 -1.00592e+00    3.14455e-01
1:707522:G:C    C   9.73034e-02 7.00973e+02 false   361194  7.02908e+04 6.59459e+02 -7.21990e-04    4.36074e-04 -1.65566e+00    9.77915e-02
1:717587:G:A    A   1.56880e-02 1.13017e+02 false   361194  1.13329e+04 1.16090e+02 3.13409e-04 1.04059e-03 3.01185e-01 7.63274e-01
1:723329:A:T    T   1.73363e-03 1.24891e+01 true    361194  1.25235e+03 8.50980e+00 -3.47324e-03    3.07012e-03 -1.13130e+00    2.57929e-01

After munging (Default parameters, dbSNP=144, ref_genome = "GRCh37"):

SNP CHR BP  A1  A2  FRQ EXPECTED_CASE_MINOR_AC  LOW_CONFIDENCE_VARIANT  N   AC  YTX BETA    SE  TSTAT   P
rs547522712 1   15791   C   T   0.99999999457138    0.0000391078    TRUE    361194  0.00392157  0   3.56322 25.3027 -0.140823   0.888009
rs568226429 1   69487   G   A   0.99999425109   0.0414151   TRUE    361194  4.15294 0   0.0113132   0.049562    -0.228264   0.819441
rs2531267   1   69569   T   C   0.999812262 1.35246 TRUE    361194  135.62  0.972549    0.00270834  0.00888871  -0.304695   0.760599
rs533633326 1   139853  C   T   0.99999433795   0.0407894   TRUE    361194  4.0902  0   0.0111344   0.0495635   -0.224648   0.822253
rs12238997  1   693731  A   G   0.88417 834.441 FALSE   361194  83674.4 805.114 0.000390192 0.000387896 -1.00592    0.314455
rs371890604 1   707522  G   C   0.9026966   700.973 FALSE   361194  70290.8 659.459 0.00072199  0.000436074 -1.65566    0.0977915
rs144155419 1   717587  G   A   0.984312    113.017 FALSE   361194  11332.9 116.09  -0.000313409    0.00104059  0.301185    0.763274
rs189787166 1   723329  A   T   0.99826637  12.4891 TRUE    361194  1252.35 8.5098  0.00347324  0.00307012  -1.1313 0.257929

As you can see, the FRQ column has flipped, so has the sign of the beta. The reason that they are flipped (logically-speaking) is because MSS thinks that There are 8 SNPs where A1 doesn't match the reference genome. I understand that when the SNPs don't match the reference, they may be flipped together with the FRQ and beta, but in this example it is clear that the allele's themselves don't appear to have been flipped (at least according to the resulting data tables) - nor should they be!

See the log here:

Formatted summary statistics will be saved to ==>  /Users/<>/Documents/testMunger/mungerN20.munged.tsv
Importing tabular file: /Users/<>/Documents/testMunger/mungerN20.tsv
Checking for empty columns.
Standardising column headers.
First line of summary statistics file: 
variant minor_allele    minor_AF    expected_case_minor_AC  low_confidence_variant  n_complete_samples  AC  ytx beta    se  tstat   pval    
Summary statistics report:
   - 9 rows
   - 9 unique variants
   - 0 genome-wide significant variants (P<5e-8)
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking for merged allele column.
Checking A2 is uppercase
Summary statistics file does not have obvious CHR/BP columns. Checking to see if they are joined in another column.
Column ID has been separated into the columns CHR, BP, A2, A1
Standardising column headers.
First line of summary statistics file: 
A2  FRQ EXPECTED_CASE_MINOR_AC  LOW_CONFIDENCE_VARIANT  N   AC  YTX BETA    SE  TSTAT   P   CHR BP  A1  
Loading SNPlocs data.
There is no SNP column found within the data. It must be inferred from other column information.
Found  Indels. These won't be checked against the reference genome as it does not contain Indels.
WARNING If your sumstat doesn't contain Indels, set the indel param to FALSE & rerun MungeSumstats::format_sumstats()
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 8 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 11 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 8 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y/MT CHR uppercase.
Checking for bi-allelic SNPs.
N already exists within sumstats_dt.
8 SNPs (100%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted  summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, 
A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, 
EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, 
FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, AF_EFF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Sorting coordinates.
Writing in tabular format ==> /Users/<>/Documents/testMunger/mungerN20.munged.tsv
Summary statistics report:
   - 8 rows (88.9% of original 9 rows)
   - 8 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 1 chromosomes
Done munging in 0.212 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
           SNP CHR     BP A1 A2        FRQ EXPECTED_CASE_MINOR_AC LOW_CONFIDENCE_VARIANT      N
1: rs547522712   1  15791  C  T 0.99999999           0.0000391078                   TRUE 361194
2: rs568226429   1  69487  G  A 0.99999425           0.0414151000                   TRUE 361194
3:   rs2531267   1  69569  T  C 0.99981226           1.3524600000                   TRUE 361194
4: rs533633326   1 139853  C  T 0.99999434           0.0407894000                   TRUE 361194
             AC      YTX       BETA          SE     TSTAT        P
1:   0.00392157 0.000000 3.56322000 25.30270000 -0.140823 0.888009
2:   4.15294000 0.000000 0.01131320  0.04956200 -0.228264 0.819441
3: 135.62000000 0.972549 0.00270834  0.00888871 -0.304695 0.760599
4:   4.09020000 0.000000 0.01113440  0.04956350 -0.224648 0.822253
Returning path to saved data.

I have confirmed that the minor_allele in the original dataset is in fact the minor allele, or A2 as MSS would see it. This was confirmed through manual search in dbSNP as well as can be inferred from the variant column (i.e. 1:15791:C:T in the first row relates to the minor_allele T, with the major allele expected to be assigned as C by MSS).

I tried the following additional tests to see if I could enforce the expected behaviour

I will paste only the first three SNPs of each of the following tables (both before and after munging), to keep things as concise as possible

1. Standardised the header prior to munging using standardise_header

Before munging

SNP A2  FRQ EXPECTED_CASE_MINOR_AC  LOW_CONFIDENCE_VARIANT  N   AC  YTX BETA    SE  TSTAT   P
1:15791:C:T T   5.42862e-9  3.91078e-5  TRUE    361194  0.00392157  0   -3.56322    25.3027 -0.140823   0.888009
1:69487:G:A A   5.74891e-6  0.0414151   TRUE    361194  4.15294 0   -0.0113132  0.049562    -0.228264   0.819441
1:69569:T:C C   1.87738e-4  1.35246 TRUE    361194  135.62  0.972549    -0.00270834 0.00888871  -0.304695   0.760599

After munging

SNP CHR BP  A1  A2  FRQ EXPECTED_CASE_MINOR_AC  LOW_CONFIDENCE_VARIANT  N   AC  YTX BETA    SE  TSTAT   P
rs547522712 1   15791   C   T   0.99999999457138    0.0000391078    TRUE    361194  0.00392157  0   3.56322 25.3027 -0.140823   0.888009
rs568226429 1   69487   G   A   0.99999425109   0.0414151   TRUE    361194  4.15294 0   0.0113132   0.049562    -0.228264   0.819441
rs2531267   1   69569   T   C   0.999812262 1.35246 TRUE    361194  135.62  0.972549    0.00270834  0.00888871  -0.304695   0.760599

2. Manually changed the name of minor_allele column to A2, prior to munging

Before munging

variant A2  minor_AF    expected_case_minor_AC  low_confidence_variant  n_complete_samples  AC  ytx beta    se  tstat   pval
1:15791:C:T T   5.42862e-09 3.91078e-05 true    361194  3.92157e-03 0.00000e+00 -3.56322e+00    2.53027e+01 -1.40823e-01    8.88009e-01
1:69487:G:A A   5.74891e-06 4.14151e-02 true    361194  4.15294e+00 0.00000e+00 -1.13132e-02    4.95620e-02 -2.28264e-01    8.19441e-01
1:69569:T:C C   1.87738e-04 1.35246e+00 true    361194  1.35620e+02 9.72549e-01 -2.70834e-03    8.88871e-03 -3.04695e-01    7.60599e-01

After munging

SNP CHR BP  A1  A2  FRQ EXPECTED_CASE_MINOR_AC  LOW_CONFIDENCE_VARIANT  N   AC  YTX BETA    SE  TSTAT   P
rs547522712 1   15791   C   T   0.99999999457138    0.0000391078    TRUE    361194  0.00392157  0   3.56322 25.3027 -0.140823   0.888009
rs568226429 1   69487   G   A   0.99999425109   0.0414151   TRUE    361194  4.15294 0   0.0113132   0.049562    -0.228264   0.819441
rs2531267   1   69569   T   C   0.999812262 1.35246 TRUE    361194  135.62  0.972549    0.00270834  0.00888871  -0.304695   0.760599

3. Manually changed the name of minor_allele column to A2 AND manually added a A1 column with the corresponding correct reference allele

Before munging

variant A1  A2  minor_AF    expected_case_minor_AC  low_confidence_variant  n_complete_samples  AC  ytx beta    se  tstat   pval
1:15791:C:T C   T   5.42862e-09 3.91078e-05 true    361194  3.92157e-03 0.00000e+00 -3.56322e+00    2.53027e+01 -1.40823e-01    8.88009e-01
1:69487:G:A G   A   5.74891e-06 4.14151e-02 true    361194  4.15294e+00 0.00000e+00 -1.13132e-02    4.95620e-02 -2.28264e-01    8.19441e-01
1:69569:T:C T   C   1.87738e-04 1.35246e+00 true    361194  1.35620e+02 9.72549e-01 -2.70834e-03    8.88871e-03 -3.04695e-01    7.60599e-01

After munging

SNP CHR BP  A1  A2  FRQ EXPECTED_CASE_MINOR_AC  LOW_CONFIDENCE_VARIANT  N   AC  YTX BETA    SE  TSTAT   P
rs547522712 1   15791   C   T   0.99999999457138    0.0000391078    TRUE    361194  0.00392157  0   3.56322 25.3027 -0.140823   0.888009
rs568226429 1   69487   G   A   0.99999425109   0.0414151   TRUE    361194  4.15294 0   0.0113132   0.049562    -0.228264   0.819441
rs2531267   1   69569   T   C   0.999812262 1.35246 TRUE    361194  135.62  0.972549    0.00270834  0.00888871  -0.304695   0.760599

The log from this final test still explicitly states that There are 8 SNPs where A1 doesn't match the reference genome.. See full log below:

Formatted summary statistics will be saved to ==>  /Users/<>/Documents/testMunger/mungerN20.A1A2.munged.tsv
Importing tabular file: /Users/<>/Documents/testMunger/mungerN20.A1A2.tsv
Checking for empty columns.
Standardising column headers.
First line of summary statistics file: 
variant A1  A2  minor_AF    expected_case_minor_AC  low_confidence_variant  n_complete_samples  AC  ytx beta    se  tstat   pval    
Summary statistics report:
   - 9 rows
   - 9 unique variants
   - 0 genome-wide significant variants (P<5e-8)
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Summary statistics file does not have obvious CHR/BP columns. Checking to see if they are joined in another column.
Column ID has been separated into the columns CHR, BP, A2, A1
Standardising column headers.
First line of summary statistics file: 
A1  A2  FRQ EXPECTED_CASE_MINOR_AC  LOW_CONFIDENCE_VARIANT  N   AC  YTX BETA    SE  TSTAT   P   CHR BP  
Loading SNPlocs data.
There is no SNP column found within the data. It must be inferred from other column information.
Found  Indels. These won't be checked against the reference genome as it does not contain Indels.
WARNING If your sumstat doesn't contain Indels, set the indel param to FALSE & rerun MungeSumstats::format_sumstats()
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 8 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 11 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 8 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y/MT CHR uppercase.
Checking for bi-allelic SNPs.
N already exists within sumstats_dt.
8 SNPs (100%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted  summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, 
A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, 
EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, 
FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, AF_EFF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Sorting coordinates.
Writing in tabular format ==> /Users/<>/Documents/testMunger/mungerN20.A1A2.munged.tsv
Summary statistics report:
   - 8 rows (88.9% of original 9 rows)
   - 8 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 1 chromosomes
Done munging in 0.204 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
           SNP CHR     BP A1 A2        FRQ EXPECTED_CASE_MINOR_AC LOW_CONFIDENCE_VARIANT      N
1: rs547522712   1  15791  C  T 0.99999999           0.0000391078                   TRUE 361194
2: rs568226429   1  69487  G  A 0.99999425           0.0414151000                   TRUE 361194
3:   rs2531267   1  69569  T  C 0.99981226           1.3524600000                   TRUE 361194
4: rs533633326   1 139853  C  T 0.99999434           0.0407894000                   TRUE 361194
             AC      YTX       BETA          SE     TSTAT        P
1:   0.00392157 0.000000 3.56322000 25.30270000 -0.140823 0.888009
2:   4.15294000 0.000000 0.01131320  0.04956200 -0.228264 0.819441
3: 135.62000000 0.972549 0.00270834  0.00888871 -0.304695 0.760599
4:   4.09020000 0.000000 0.01113440  0.04956350 -0.224648 0.822253
Returning path to saved data.

I apologise if I'm doing something wrong, but I tried to be extensive with my testing and exhaustive in my research into the matter within your docs!

3. Session info

> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Ventura 13.0.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.5.2       stringr_1.5.0       dplyr_1.0.10        purrr_0.3.5        
 [5] readr_2.1.3         tidyr_1.2.1         tibble_3.1.8        ggplot2_3.4.0      
 [9] tidyverse_1.3.2     GenomeInfoDb_1.34.4 IRanges_2.32.0      S4Vectors_0.36.1   
[13] BiocGenerics_0.44.0 MungeSumstats_1.6.0

loaded via a namespace (and not attached):
  [1] bitops_1.0-7                                matrixStats_0.63.0                         
  [3] fs_1.5.2                                    lubridate_1.9.0                            
  [5] bit64_4.0.5                                 filelock_1.0.2                             
  [7] progress_1.2.2                              httr_1.4.4                                 
  [9] googleAuthR_2.0.0                           GenomicFiles_1.34.0                        
 [11] tools_4.2.2                                 backports_1.4.1                            
 [13] utf8_1.2.2                                  R6_2.5.1                                   
 [15] colorspace_2.0-3                            DBI_1.1.3                                  
 [17] withr_2.5.0                                 tidyselect_1.2.0                           
 [19] prettyunits_1.1.1                           bit_4.0.5                                  
 [21] curl_4.3.3                                  compiler_4.2.2                             
 [23] rvest_1.0.3                                 cli_3.4.1                                  
 [25] Biobase_2.58.0                              xml2_1.3.3                                 
 [27] DelayedArray_0.24.0                         rtracklayer_1.58.0                         
 [29] scales_1.2.1                                rappdirs_0.3.3                             
 [31] digest_0.6.31                               Rsamtools_2.14.0                           
 [33] R.utils_2.12.2                              XVector_0.38.0                             
 [35] pkgconfig_2.0.3                             BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1
 [37] MatrixGenerics_1.10.0                       dbplyr_2.2.1                               
 [39] fastmap_1.1.0                               BSgenome_1.66.1                            
 [41] readxl_1.4.1                                rlang_1.0.6                                
 [43] rstudioapi_0.14                             RSQLite_2.2.19                             
 [45] BiocIO_1.8.0                                generics_0.1.3                             
 [47] jsonlite_1.8.4                              vroom_1.6.0                                
 [49] BiocParallel_1.32.4                         googlesheets4_1.0.1                        
 [51] R.oo_1.25.0                                 VariantAnnotation_1.44.0                   
 [53] RCurl_1.98-1.9                              magrittr_2.0.3                             
 [55] GenomeInfoDbData_1.2.9                      Matrix_1.5-3                               
 [57] munsell_0.5.0                               Rcpp_1.0.9                                 
 [59] fansi_1.0.3                                 lifecycle_1.0.3                            
 [61] R.methodsS3_1.8.2                           stringi_1.7.8                              
 [63] yaml_2.3.6                                  SummarizedExperiment_1.28.0                
 [65] zlibbioc_1.44.0                             BiocFileCache_2.6.0                        
 [67] grid_4.2.2                                  blob_1.2.3                                 
 [69] parallel_4.2.2                              crayon_1.5.2                               
 [71] lattice_0.20-45                             haven_2.5.1                                
 [73] Biostrings_2.66.0                           GenomicFeatures_1.50.3                     
 [75] hms_1.1.2                                   KEGGREST_1.38.0                            
 [77] pillar_1.8.1                                GenomicRanges_1.50.1                       
 [79] rjson_0.2.21                                BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000     
 [81] codetools_0.2-18                            biomaRt_2.54.0                             
 [83] reprex_2.0.2                                XML_3.99-0.13                              
 [85] glue_1.6.2                                  modelr_0.1.10                              
 [87] SNPlocs.Hsapiens.dbSNP144.GRCh37_0.99.20    data.table_1.14.6                          
 [89] SNPlocs.Hsapiens.dbSNP144.GRCh38_0.99.20    tzdb_0.3.0                                 
 [91] png_0.1-8                                   vctrs_0.5.1                                
 [93] cellranger_1.1.0                            gtable_0.3.1                               
 [95] assertthat_0.2.1                            cachem_1.0.6                               
 [97] broom_1.0.2                                 restfulr_0.0.15                            
 [99] googledrive_2.0.0                           gargle_1.2.1                               
[101] GenomicAlignments_1.34.0                    AnnotationDbi_1.60.0                       
[103] memoise_2.0.1                               timechange_0.1.1                           
[105] ellipsis_0.3.2 
jksull commented 1 year ago

For good measure, I set the allele_flip_check flag to FALSE:

Before munging

variant minor_allele    minor_AF    expected_case_minor_AC  low_confidence_variant  n_complete_samples  AC  ytx beta    se  tstat   pval
1:15791:C:T T   5.42862e-09 3.91078e-05 true    361194  3.92157e-03 0.00000e+00 -3.56322e+00    2.53027e+01 -1.40823e-01    8.88009e-01
1:69487:G:A A   5.74891e-06 4.14151e-02 true    361194  4.15294e+00 0.00000e+00 -1.13132e-02    4.95620e-02 -2.28264e-01    8.19441e-01
1:69569:T:C C   1.87738e-04 1.35246e+00 true    361194  1.35620e+02 9.72549e-01 -2.70834e-03    8.88871e-03 -3.04695e-01    7.60599e-01

After munging

SNP CHR BP  A1  A2  FRQ EXPECTED_CASE_MINOR_AC  LOW_CONFIDENCE_VARIANT  N   AC  YTX BETA    SE  TSTAT   P
rs547522712 1   15791   T   C   0.00000000542862    0.0000391078    TRUE    361194  0.00392157  0   -3.56322    25.3027 -0.140823   0.888009
rs568226429 1   69487   A   G   0.00000574891   0.0414151   TRUE    361194  4.15294 0   -0.0113132  0.049562    -0.228264   0.819441
rs2531267   1   69569   C   T   0.000187738 1.35246 TRUE    361194  135.62  0.972549    -0.00270834 0.00888871  -0.304695   0.760599

You can see that the FRQ and beta are no longer flipped, but the A1 allele and A2 allele are incorrectly assigned (i.e. they have been reversed). It seems like with this dataset, MSS cannot properly recognise the columns?

jksull commented 1 year ago

Apologies for the added comments, but I also thought it's imperative to mention that this dataset is taken directly from the commonly used Neale Lab GWAS summary statistics, run on UK Biobank data. You can download the full file using this command:

wget "https://broad-ukb-sumstats-us-east-1.s3.amazonaws.com/round2/additive-tsvs/N20.gwas.imputed_v3.both_sexes.tsv.bgz" -O N20.gwas.imputed_v3.both_sexes.tsv.bgz
Al-Murphy commented 1 year ago

Hey!

Great to hear you find the package useful. Thank you for the very detailed bug description, it made things quite easy to investigate.

So the behaviour you are noting isn't necessarily a bug per se. I set up MSS so that if a column with 1:69569:T:C was detected it would be taken as CHR:BP:A2:A1. This is what this printed message meant:

Column ID has been separated into the columns CHR, BP, A2, A1

As you can imagine there is no way to know whether the inputted data is CHR:BP:A2:A1 or CHR:BP:A1:A2 so I added the message to at least explain the assumption.

Another solution which is already in MSS is to rename the column, giving the order. See below, I just copied the 9 rows in your example to a csv:

#if you rename variant it will work as expected
samp_ss <- data.table::fread("~/Downloads/samp_dat_ss.csv")
samp_ss[,MAF_ORG:=minor_AF]
data.table::setnames(samp_ss,"variant","CHR:BP:A1:A2")
a2 <- MungeSumstats::format_sumstats(samp_ss,
                                    dbSNP = 144,
                                    ref_genome = "GRCh37")
munged2 <- data.table::fread(a2)
munged2
           SNP CHR     BP A1 A2      FRQ EXPECTED_CASE_MINOR_AC LOW_CONFIDENCE_VARIANT      N       AC     YTX      BETA
1: rs547522712   1  15791  C  T 5.43e-09               3.91e-05                   TRUE 361194 3.92e-03   0.000 -3.560000
2: rs568226429   1  69487  G  A 5.75e-06               4.14e-02                   TRUE 361194 4.15e+00   0.000 -0.011300
3:   rs2531267   1  69569  T  C 1.88e-04               1.35e+00                   TRUE 361194 1.36e+02   0.973 -0.002710
4: rs533633326   1 139853  C  T 5.66e-06               4.08e-02                   TRUE 361194 4.09e+00   0.000 -0.011100
5:  rs12238997   1 693731  A  G 1.16e-01               8.34e+02                  FALSE 361194 8.37e+04 805.000 -0.000390
6: rs371890604   1 707522  G  C 9.73e-02               7.01e+02                  FALSE 361194 7.03e+04 659.000 -0.000722
7: rs144155419   1 717587  G  A 1.57e-02               1.13e+02                  FALSE 361194 1.13e+04 116.000  0.000313
8: rs189787166   1 723329  A  T 1.73e-03               1.25e+01                   TRUE 361194 1.25e+03   8.510 -0.003470
         SE  TSTAT      P  MAF_ORG
1: 2.53e+01 -0.141 0.8880 5.43e-09
2: 4.96e-02 -0.228 0.8190 5.75e-06
3: 8.89e-03 -0.305 0.7610 1.88e-04
4: 4.96e-02 -0.225 0.8220 5.66e-06
5: 3.88e-04 -1.010 0.3140 1.16e-01
6: 4.36e-04 -1.660 0.0978 9.73e-02
7: 1.04e-03  0.301 0.7630 1.57e-02
8: 3.07e-03 -1.130 0.2580 1.73e-03

However, since you also have a separate minor allele (A2) column supplied, I think MSS could be a little smarter and use this to infer the order of CHR:BP:A2:A1 or CHR:BP:A1:A2. This has been implemented in the dev version of MSS (1.7.11) which you can use immediately by installing directly from Github. This update results in the expected behaviour:

samp_ss <- data.table::fread("~/Downloads/samp_dat_ss.csv")
samp_ss[,MAF_ORG:=minor_AF]
a <- MungeSumstats::format_sumstats(samp_ss,
                               dbSNP = 144,
                               ref_genome = "GRCh37")
munged <- data.table::fread(a)
munged
           SNP CHR     BP A1 A2      FRQ EXPECTED_CASE_MINOR_AC LOW_CONFIDENCE_VARIANT      N       AC     YTX      BETA
1: rs547522712   1  15791  C  T 5.43e-09               3.91e-05                   TRUE 361194 3.92e-03   0.000 -3.560000
2: rs568226429   1  69487  G  A 5.75e-06               4.14e-02                   TRUE 361194 4.15e+00   0.000 -0.011300
3:   rs2531267   1  69569  T  C 1.88e-04               1.35e+00                   TRUE 361194 1.36e+02   0.973 -0.002710
4: rs533633326   1 139853  C  T 5.66e-06               4.08e-02                   TRUE 361194 4.09e+00   0.000 -0.011100
5:  rs12238997   1 693731  A  G 1.16e-01               8.34e+02                  FALSE 361194 8.37e+04 805.000 -0.000390
6: rs371890604   1 707522  G  C 9.73e-02               7.01e+02                  FALSE 361194 7.03e+04 659.000 -0.000722
7: rs144155419   1 717587  G  A 1.57e-02               1.13e+02                  FALSE 361194 1.13e+04 116.000  0.000313
8: rs189787166   1 723329  A  T 1.73e-03               1.25e+01                   TRUE 361194 1.25e+03   8.510 -0.003470
         SE  TSTAT      P  MAF_ORG
1: 2.53e+01 -0.141 0.8880 5.43e-09
2: 4.96e-02 -0.228 0.8190 5.75e-06
3: 8.89e-03 -0.305 0.7610 1.88e-04
4: 4.96e-02 -0.225 0.8220 5.66e-06
5: 3.88e-04 -1.010 0.3140 1.16e-01
6: 4.36e-04 -1.660 0.0978 9.73e-02
7: 1.04e-03  0.301 0.7630 1.57e-02
8: 3.07e-03 -1.130 0.2580 1.73e-03
jksull commented 1 year ago

Thanks for the quick response! I can see what you mean. I did think that if nothing else, we could expect the format CHR:BP:A1:A2 to be the standard, as it is in line with the SPDI Notation (mostly), but if I've learnt anything recently it's not to trust GWAS Summary statistics..ever!

Al-Murphy commented 1 year ago

Hey! Fair comment, I've updated so the default is CHR:BP:A1:A2 too (v 1.7.12). Let me know if this isn't working as expected

jksull commented 1 year ago

Thanks a lot! I think that makes sense, and will let you know if I see any issues :)

jksull commented 1 year ago

Hey @Al-Murphy , just checking that the behaviour you describe in your first response as part of 1.7.11:

"However, since you also have a separate minor allele (A2) column supplied, I think MSS could be a little smarter and use this to infer the order of CHR:BP:A2:A1 or CHR:BP:A1:A2."

Is still the expected behaviour in the 1.7.12? And additionally, does it infer on a variant-by-variant (i.e. row-by-row) manner?

It appears that SNPs within the same summary stats file can be in either order:

e.g.

variant   minor_allele minor_AF
1:69569:T:C  C  0.0156927
1:753405:C:A  C  0.12976

Which would make this imputation (if it is applied on the entire df) incorrect for some SNPs, with little way of being certain.

The problem I am envisioning if the order is inferred for the entire dataframe, is that the second SNP in the example above will flip the minor_allele to become the A1, despite it being explicit that the T is the minor allele, if I understand correctly?

Al-Murphy commented 1 year ago

Hey! Just to clarify, the use of an additional column for A1/A2 that I described as being implemented in 1.7.11 is still there in 1.7.12. The check works by simply looking at the first row in the sumstats and checking the A2 column there to find a match. I suppose it might be better to sample more rows and do a majority vote (feel free to make a pull request with this if you think it's necessary for your use case). This check is however overwritten though if the column is named as something like CHR:BP:A1:A2.

The example you described would be very worrying if it is the case. If there were row differences where the A1 and A2 values order change but their MAF and Beta columns aren't flipped along with these, I don't see how it could ever be accounted for? Think about the example where we have:

SNP  CHR  BP  A1  A2  BETA  FRQ
rs0001  1  15791  C  T 0.013 5.43e-09
rs0002  1  69487  G  A -0.018  5e-06

If however, rs0001 had A1 = major allele and rs0002 had A1 = minor allele, we would need to assume that the BETA and FRQ values are not in the same direction i.e. rs0001 FRQ describes the minor allele frequency or MAF and for rs0002 it is the major allele frequency or vice versa. Then we can flip the alleles for one of these and also flip the FRQ and BETA values. This idea of consistent direction of these effect and frequency columns is paramount.

jksull commented 1 year ago

"This idea of consistent direction of these effect and frequency columns is paramount."

I completely agree, but I'm afraid this doesn't seem to be the case.

I have to admit, I'm a little confused. But allow me to provide some further context.

In the above example (using MSS 1.7.8), the resulting munged stats looks something like this:

SNP  CHR      BP   A1   A2    FREQ
rs25131267 1    69569    T    C     0.0156927
rs3115860  1     753405    C    A     0.1297

As you can see, the minor_allele for the second SNP is inferred as the A1 allele (but notice how the FREQ hasn't been flipped).

So, based on your response above, 1.7.11 won't ignore the variant column in favour of the minor_allele, on a SNP-by-SNP basis, but instead only on the first row (and therefore if this inconsistent direction is present, the resulting munged stats won't be correct)?

jksull commented 1 year ago

I can think of two solutions to the problem

  1. Check the order of the SNP column based on the minor allele column on a SNP-by-SNP basis. If the minor_allele is actually A1, then flip the beta and the FREQ accordingly (if these flags are set to TRUE).
  2. Have option to ignore the SNP column entirely, firstly allowing MSS to impute the rsID based on CHR and BP (I believe this should be possible as it is?), and also perform the allele check to decide whether the minor_allele is actually the A1 allele (after finding it on the REF), and flip the other values accordingly (again, of course only if these are set to TRUE).

As I understand, the latter part of option 2 (i.e. _perform the allele check to decide whether the minor_allele is actually the A1 allele, and flip the other values accordingly._) is already part of MSS, but it would be necessary that it does not infer the order based on just the first row, but rather to perform the rsID imputation (based on CHR and BP) separately to deciding whether the minor_allele matches the REF (i.e. is A1).

There are of course underlying knowledge of the specific dataset that must be understood regarding whether we want to flip the BETA as well (some datasets provide a minor_allele, a minor_AF, but state explicitly that the minor allele can be the REF allele, and yet they provide the BETA relative to the effect allele - not relative to the minor_allele).

jksull commented 1 year ago

And regarding this:

"If however, rs0001 had A1 = major allele and rs0002 had A1 = minor allele, we would need to assume that the BETA and FRQ values are not in the same direction i.e. rs0001 FRQ describes the minor allele frequency or MAF and for rs0002 it is the major allele frequency or vice versa"

This is exactly the issue, and a seemingly a common format of summary statistics. If you look at the following column description in the very commonly used Neale Lab data, you can see this is the case :(

image

and

image

So the resulting summary statistics from which this thread was started:

variant minor_allele    minor_AF    expected_case_minor_AC  low_confidence_variant  n_complete_samples  AC  ytx beta    se  tstat   pval
1:15791:C:T T   5.42862e-09 3.91078e-05 true    361194  3.92157e-03 0.00000e+00 -3.56322e+00    2.53027e+01 -1.40823e-01    8.88009e-01
1:69487:G:A A   5.74891e-06 4.14151e-02 true    361194  4.15294e+00 0.00000e+00 -1.13132e-02    4.95620e-02 -2.28264e-01    8.19441e-01
1:69569:T:C C   1.87738e-04 1.35246e+00 true    361194  1.35620e+02 9.72549e-01 -2.70834e-03    8.88871e-03 -3.04695e-01    7.60599e-01
1:139853:C:T    T   5.66205e-06 4.07894e-02 true    361194  4.09020e+00 0.00000e+00 -1.11344e-02    4.95635e-02 -2.24648e-01    8.22253e-01
1:692794:CA:C   C   1.10640e-01 7.97048e+02 false   361194  7.99247e+04 7.74761e+02 -3.29738e-04    4.10525e-04 -8.03211e-01    4.21853e-01
1:693731:A:G    G   1.15830e-01 8.34441e+02 false   361194  8.36744e+04 8.05114e+02 -3.90192e-04    3.87896e-04 -1.00592e+00    3.14455e-01
1:707522:G:C    C   9.73034e-02 7.00973e+02 false   361194  7.02908e+04 6.59459e+02 -7.21990e-04    4.36074e-04 -1.65566e+00    9.77915e-02
1:717587:G:A    A   1.56880e-02 1.13017e+02 false   361194  1.13329e+04 1.16090e+02 3.13409e-04 1.04059e-03 3.01185e-01 7.63274e-01
1:723329:A:T    T   1.73363e-03 1.24891e+01 true    361194  1.25235e+03 8.50980e+00 -3.47324e-03    3.07012e-03 -1.13130e+00    2.57929e-01

...has both directions of alleles, and the minor_AF is relative only to the minor_allele, where the "alt allele is not always minor.

It's important to not that the BETA is relative to the effect allele, which is also not necessarily the minor_allele... (my head hurts).

Regarding how the minor_AF is calculated based on the AF in the original variant annotation file )thevariants.tsv.bgz` (first photo), they explain it in FAQ:

image

And for the GWAS statistics themselves:

image


Back to MSS, assuming the variant order of CHR:BP:A1:A2, perhaps this could just be solved by checking whether the minor_allele corresponds to the A1 (and if so make sure that the FREQ > 0.5), and likewise if the minor_allele corresponds to the A2, then the FREQ must be < 0.5. To do this there must be a distinction that minor_allele does not necessarily map to A2 - as evident by the dataset I have linked.

Al-Murphy commented 1 year ago

Hey! A few things in this that are worth pointing out:

Check the order of the SNP column based on the minor allele column on a SNP-by-SNP basis. If the minor_allele is actually A1, then flip the beta and the FREQ accordingly (if these flags are set to TRUE).

I don't think this is the best idea because what we actually want is to identify the alternative allele not the minor allele in MSS as this is the allele that the effect columns relate to i.e. the allele that is being tested. This may not be the minor allele in all instances (as you show). I'll explain the best practice on this going forward later.

Have option to ignore the SNP column entirely, firstly allowing MSS to impute the rsID based on CHR and BP (I believe this should be possible as it is?), and also perform the allele check to decide whether the minor_allele is actually the A1 allele (after finding it on the REF), and flip the other values accordingly (again, of course only if these are set to TRUE). This is possible since an RS ID relates to the combination of BP and CHR. So I mean that if the allele is non-bi-allelic, both alternative alleles will have the same RS ID. However, this would cause the same issue as above - we want the alternative allele not the minor allele in MSS as this is the allele that the effect columns relate to. Getting A1/A2 from a reference dataset would not guarantee this (plus if you have any non-bi-allelic alleles, these would cause issues too).

Once we have the alternative allele, MSS will then flip the necessary SNPs (in a SNP by SNP basis) to ensure A1 is the reference or major allele and A2 is the minor. It will also flip the FRQ and effect columns. It is worth noting that, in this way, MSS will only flip the effects in the correct columns.

Looking at the screen grabs of the sumstats in question, I think there is a simple solution to ensure we identify the alternative allele. Essentially, we want MSS to pick up AF as the FRQ value since this relates to the FRQ of the alternative allele studied (whether that be the minor or not). The beta (i.e. the effect column) is already relating to the alternative allele. So as long as MSS picks up this, the SNP flipping will work correctly. MSS already has this as a value for FRQ (see data("sumstatsColHeaders") for all the mappings). And if you pass MSS a sumstats with AF and minor_allele it will take AF as the FRQ value (i.e. the correct one). See below:

a <- MungeSumstats::formatted_example()
data.table::setnames(a,'FRQ','AF')
a[,minor_allele:=AF]
b <- MungeSumstats::format_sumstats(a)
data.table::fread(b)
> data.table::fread(b)
            SNP CHR        BP A1 A2     FRQ   BETA    SE         P MINOR_ALLELE
 1:  rs34305371   1  72733610  G  A 0.91231 -0.035 0.005 3.762e-14      0.08769
 2:   rs1008078   1  91189731  C  T 0.62690  0.016 0.003 6.005e-10      0.37310
 3:  rs61787263   1  98618714  T  C 0.76120  0.016 0.003 5.391e-08      0.76120
 4:   rs2992632   1 243503764  A  T 0.66040  0.017 0.003 8.227e-09      0.66040

Note even if this wasn't the case though you can change the sumstatsColHeaders reference as you need. Hope this makes sense? I agree the whole sumstats standardisation does start to melt your head (really this is what drove me to make MSS).

Also, I'm reopening this issue, not because I plan to work on any alternative functionality based on this but more so, so others are aware of it and we can come back to it at a the future stage!

Al-Murphy commented 1 year ago

I believe the subsequent discussions on related issues covered what's going on here so I'm going to close for now.

Cheers, Alan.