Accidentally assigning AF as INFO

bschilder commented 2 years ago

1. Bug description

I've applied these fixes to MungeSumstats v1.3.20

AF (allele frequency) was accidentally being assigned as INFO column in VCFs where the INFO rows started with "AF". This caused a large number of SNPs to be incorrectly dropped during the check_info_score step.
If INFO score is not available, INFO column is now dropped entirely (rather than assigning all 1s).
- Adjusted test-vcf_formatting to reflect this. This avoids ambiguity about whether the INFO score is real or not.
check_info_score:
- Added extra messages in various conditions where INFO is not used for filtering, and don't add log_files$info_filter in these instances.
- Added unit tests.

Expected behaviour

Don't apply INFO score filtering when the INFO col actually contains allele frequency (AF).

2. Reproducible example

Code

 res <- import_sumstats(ids = "ubm-a-2929", nThread = 10)

Console output: MungeSumstats 1.3.19

% SNPs kept by the end was very low:

221,499 rows (1.9% of original 11,734,353 rows)

Reading header.
Importing VCF file: /tmp/Rtmp9baNSX/ubm-a-2929.vcf.gz
Reading VCF file.
|--------------------------------------------------|
|==================================================|
Standardising column headers.
First line of summary statistics file: 
CHROM   POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ubm-a-2929  
Removing non-standard columns: QUAL, FILTER, FORMAT
Parsing 'UBM-A-2929' data column.
0 empty column(s) detected.
VCF file has -log10 P-values, these will be converted to unadjusted p-values in the 'P' column.
Formatting INFO column.
INFO column is actually AF, it will be converted.
Standardising column headers.
First line of summary statistics file: 
CHR BP  SNP A1  A2  INFO    ES  SE  LP  AF  ID  P   
Summary statistics report:
   - 11,734,353 rows
   - 11,719,908 unique variants
   - 132,289 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Inferring genome build.
Loading reference genome data.
Loading reference genome data.
Inferred genome build: GRCH37
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading reference genome data.
48,199 SNPs are not on the reference genome. These will be corrected from the reference genome.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/ubm-a-2929/logs/snp_not_found_from_chr_bp.tsv.gz
Loading reference genome data.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 12 SNPs where neither A1 nor A2 match the reference genome. These will be removed.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/ubm-a-2929/logs/alleles_dont_match_ref_gen.tsv.gz
There are 2 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
1,193,750 p-values are >1 which LDSC/MAGMA may not be able to handle. These will be converted to 1.
Checking for duplicate SNPs from SNP ID.
42,839 RSIDs are duplicated in the sumstats file. These duplicates will be removed
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/ubm-a-2929/logs/dup_snp_id.tsv.gz
Checking for SNPs with duplicated base-pair positions.
1 base-pair positions are duplicated in the sumstats file. These duplicates will be removed.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/ubm-a-2929/logs/dup_base_pair_position.tsv.gz
Filtering SNPs based on INFO score.
11,446,014 SNPs are below the INFO threshold of 0.9 and will be removed.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/ubm-a-2929/logs/info_filter.tsv.gz
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y CHR uppercase.
Checking for bi-allelic SNPs.
8,194 SNPs are non-biallelic. These will be removed.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/ubm-a-2929/logs/snp_bi_allelic.tsv.gz
Warning: When method is an integer, must be >0.
221,498 SNPs (100%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted  summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Could not recognize genome build of:
 - target_genome
These will be inferred from the data.
Sorting coordinates.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/ubm-a-2929/ubm-a-2929.tsv.gz
Summary statistics report:
   - 221,499 rows (1.9% of original 11,734,353 rows)
   - 221,187 unique variants
   - 2,631 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Successfully finished preparing sumstats file, preview:
Reading header.
          SNP CHR     BP A1 A2    INFO    BETA     SE        LP     FRQ         ID         P
1: rs12025928   1 546697  A  G 0.91442  0.0080 0.0368 1.0969100 0.91442 rs12025928 0.0800000
2:  rs2977670   1 723891  G  C 0.96456  0.0626 0.0423 0.0655015 0.96456  rs2977670 0.8600001
3:  rs2905042   1 767780  G  A 0.99884 -0.0005 0.2420       Inf 0.99884  rs2905042 0.0000000
4:  rs3095826   1 770918  T  G 0.99833  0.0263 0.2272 1.3979400 0.99833  rs3095826 0.0400000

Console output: MungeSumstats 1.3.20

After applying some fixes, this is the result. Drastically more SNPs are kept:

11,359,175 rows (96.8% of original 11,734,353 rows)

Processing 1 datasets from Open GWAS.

========== Processing dataset : ubm-a-2929 ==========

Downloading VCF ==> /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpUaBMco/ubm-a-2929.vcf.gz
Downloading with download.file.
trying URL 'https://gwas.mrcieu.ac.uk/files/ubm-a-2929/ubm-a-2929.vcf.gz'
Content type 'application/gzip' length 308589956 bytes (294.3 MB)
==================================================
downloaded 294.3 MB

Downloading VCF index ==> https://gwas.mrcieu.ac.uk/files/ubm-a-2929/ubm-a-2929.vcf.gz.tbi
Downloading with download.file.
trying URL 'https://gwas.mrcieu.ac.uk/files/ubm-a-2929/ubm-a-2929.vcf.gz.tbi'
Content type 'application/gzip' length 1570415 bytes (1.5 MB)
==================================================
downloaded 1.5 MB

Formatted summary statistics will be saved to ==> /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpUaBMco/ubm-a-2929/ubm-a-2929.tsv.gz
Reading header.
Importing VCF file: /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpUaBMco/ubm-a-2929.vcf.gz
Reading VCF file.
|--------------------------------------------------|
|==================================================|
Standardising column headers.
First line of summary statistics file: 
CHROM   POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ubm-a-2929  
Removing non-standard columns: QUAL, FILTER, FORMAT
Parsing 'UBM-A-2929' data column.
0 empty column(s) detected.
VCF file has -log10 P-values, these will be converted to unadjusted p-values in the 'P' column.
Formatting INFO column.
INFO column is actually AF, it will be converted.
Standardising column headers.
First line of summary statistics file: 
CHR BP  SNP A1  A2  ES  SE  LP  AF  ID  P   
Summary statistics report:
   - 11,734,353 rows
   - 11,719,908 unique variants
   - 132,289 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Inferring genome build.
Loading reference genome data.
Loading reference genome data.
Inferred genome build: GRCH37
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading reference genome data.
19,631 SNPs are not on the reference genome. These will be corrected from the reference genome.
Loading reference genome data.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 12 SNPs where neither A1 nor A2 match the reference genome. These will be removed.
There are 2 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
1,199,604 p-values are >1 which LDSC/MAGMA may not be able to handle. These will be converted to 1.
Checking for duplicate SNPs from SNP ID.
100,095 RSIDs are duplicated in the sumstats file. These duplicates will be removed
Checking for SNPs with duplicated base-pair positions.
1 base-pair positions are duplicated in the sumstats file. These duplicates will be removed.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y CHR uppercase.
Checking for bi-allelic SNPs.
414,778 SNPs are non-biallelic. These will be removed.
Warning: When method is an integer, must be >0.
1,435,557 SNPs (12.6%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted  summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Could not recognize genome build of:
 - target_genome
These will be inferred from the data.
Sorting coordinates.
Writing in tabular format ==> /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpUaBMco/ubm-a-2929/ubm-a-2929.tsv.gz
Summary statistics report:                                                                                                           
   - 11,359,175 rows (96.8% of original 11,734,353 rows)
   - 11,352,589 unique variants
   - 128,148 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Successfully finished preparing sumstats file, preview:
Reading header.
           SNP CHR    BP A1 A2    BETA     SE       LP       FRQ          ID         P
1:   rs2462492   1 54676  C  T -0.0354 0.0260 0.119186 0.3996650   rs2462492 0.7600007
2:   rs3107975   1 55326  T  C  0.1696 0.1320 0.154902 0.0089751   rs3107975 0.6999999
3:  rs74447903   1 57033  T  C  0.2794 0.3047 0.356547 0.0018298  rs74447903 0.4400003
4: rs114608975   1 86028  T  C  0.0425 0.0418 0.292430 0.1044800 rs114608975 0.5099998
Returning path to saved data.

ubm-a-2929 : Done in 23.66 minutes.

Done with all processing in 23.66 minutes.

3. Session info

``` R version 4.1.3 (2022-03-10) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Monterey 12.3.1 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] MungeSumstats_1.3.19 loaded via a namespace (and not attached): [1] fs_1.5.2 bitops_1.0-7 [3] matrixStats_0.61.0 xopen_1.0.0 [5] usethis_2.1.5 devtools_2.4.3 [7] bit64_4.0.5 filelock_1.0.2 [9] progress_1.2.2 httr_1.4.2 [11] rprojroot_2.0.3 GenomeInfoDb_1.30.1 [13] googleAuthR_2.0.0 tools_4.1.3 [15] utf8_1.2.2 R6_2.5.1 [17] DBI_1.1.2 BiocGenerics_0.40.0 [19] withr_2.5.0 tidyselect_1.1.2 [21] prettyunits_1.1.1 processx_3.5.3 [23] bit_4.0.4 curl_4.3.2 [25] compiler_4.1.3 cli_3.2.0 [27] Biobase_2.54.0 xml2_1.3.3 [29] desc_1.4.1 DelayedArray_0.20.0 [31] rtracklayer_1.54.0 callr_3.7.0 [33] rappdirs_0.3.3 commonmark_1.8.0 [35] stringr_1.4.0 digest_0.6.29 [37] Rsamtools_2.10.0 R.utils_2.11.0 [39] XVector_0.34.0 BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1 [41] pkgconfig_2.0.3 htmltools_0.5.2 [43] sessioninfo_1.2.2 MatrixGenerics_1.6.0 [45] dbplyr_2.1.1 fastmap_1.1.0 [47] BSgenome_1.62.0 rlang_1.0.2 [49] rstudioapi_0.13 RSQLite_2.2.12 [51] BiocIO_1.4.0 generics_0.1.2 [53] jsonlite_1.8.0 BiocParallel_1.28.3 [55] dplyr_1.0.8 R.oo_1.24.0 [57] VariantAnnotation_1.40.0 RCurl_1.98-1.6 [59] magrittr_2.0.3 GenomeInfoDbData_1.2.7 [61] Matrix_1.4-1 Rcpp_1.0.8.3 [63] S4Vectors_0.32.4 fansi_1.0.3 [65] lifecycle_1.0.1 R.methodsS3_1.8.1 [67] stringi_1.7.6 yaml_2.3.5 [69] SummarizedExperiment_1.24.0 zlibbioc_1.40.0 [71] brio_1.1.3 pkgbuild_1.3.1 [73] BiocFileCache_2.2.1 grid_4.1.3 [75] blob_1.2.2 parallel_4.1.3 [77] crayon_1.5.1 lattice_0.20-45 [79] Biostrings_2.62.0 GenomicFeatures_1.46.5 [81] hms_1.1.1 KEGGREST_1.34.0 [83] knitr_1.38 ps_1.6.0 [85] pillar_1.7.0 GenomicRanges_1.46.1 [87] rjson_0.2.21 BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000 [89] biomaRt_2.50.3 stats4_4.1.3 [91] pkgload_1.2.4 XML_3.99-0.9 [93] glue_1.6.2 SNPlocs.Hsapiens.dbSNP144.GRCh37_0.99.20 [95] remotes_2.4.2 data.table_1.14.2 [97] SNPlocs.Hsapiens.dbSNP144.GRCh38_0.99.20 png_0.1-7 [99] vctrs_0.4.0 testthat_3.1.3 [101] rcmdcheck_1.4.0 purrr_0.3.4 [103] assertthat_0.2.1 cachem_1.0.6 [105] xfun_0.30 restfulr_0.0.13 [107] roxygen2_7.1.2 gargle_1.2.0 [109] tibble_3.1.6 GenomicAlignments_1.30.0 [111] AnnotationDbi_1.56.2 memoise_2.0.1 [113] IRanges_2.28.0 ellipsis_0.3.2 ```

bschilder commented 2 years ago

Just tried an alternative function instead of read_vcf, and noticed some things:

path <- "https://gwas.mrcieu.ac.uk/files/ubm-a-2929/ubm-a-2929.vcf.gz"
vcf <- VariantAnnotation::readVcf(file = path)
print(vcf)

The vcf object contains multiple fields in the geno. One of them is "AF" (alternative allele frequency), and another is "SI" (imputation accuracy).

Given that our original method is rather messy and prone to missing a lot of these pieces of info, I think we should move towards using VariantAnnotation::readVcf. I'll try to write a new vcf function that does this, and then converts to data.table format.

class: CollapsedVCF 
dim: 11734353 1 
rowRanges(vcf):
  GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 1 column: AF
info(header(vcf)):
      Number Type  Description     
   AF A      Float Allele Frequency
geno(vcf):
  List of length 9: ES, SE, LP, AF, SS, EZ, SI, NC, ID
geno(header(vcf)):
      Number Type   Description                                                       
   ES A      Float  Effect size estimate relative to the alternative allele           
   SE A      Float  Standard error of effect size estimate                            
   LP A      Float  -log10 p-value for effect estimate                                
   AF A      Float  Alternate allele frequency in the association study               
   SS A      Float  Sample size used to estimate genetic effect                       
   EZ A      Float  Z-score provided if it was used to derive the EFFECT and SE fields
   SI A      Float  Accuracy score of summary data imputation                         
   NC A      Float  Number of cases used to estimate genetic effect                   
   ID 1      String Study variant identifier

Al-Murphy commented 2 years ago

Definitely makes sense to move over to a more robust read_vcf version. On the Allele Frequency (AF), how does AF differ from minor allele frequency? I assumed they were the same - this is why I picked up AF as INFO

bschilder commented 2 years ago

We discussed this in person, but just to document here: INFO can contain different fields that vary across files and within files (across rows). This means that using a simple parsing strategy is likely to make a lot of mistakes. For this reason, I rewrote read_vcf (and various other support functions) to take advantage of VariantAnnotation, which robustly parses all fields of any given VCF. read_vcf can write/return as VCF or data.table (via the vcf2df internal function).

The downside is that read_vcf is now much slower (<1 min vs. 9 min for 11M variants) I've been urging the VariantAnnotation maintainers to improve the efficiency of these functions so we can reduce this compute time. https://github.com/Bioconductor/VariantAnnotation/issues/57

Al-Murphy / MungeSumstats