Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

Many non biallelic variants get removed #124

Closed AMCalejandro closed 2 years ago

AMCalejandro commented 2 years ago

1. Bug description

Did not have a change to read #111, but it may well be related to it

Console output

******::NOTE::******
 - Formatted results will be saved to `tempdir()` by default.
 - This means all formatted summary stats will be deleted upon ending the R session.
 - To keep formatted summary stats, change `save_path`  ( e.g. `save_path=file.path('./formatted',basename(path))` ),   or make sure to copy files elsewhere after processing  ( e.g. `file.copy(save_path, './formatted/' )`.
 ******************** 

Formatted summary statistics will be saved to ==>  /tmp/Rtmp4OEijF/file2efc29f07041.tsv.gz
Importing tabular file: ~/echolocatoR/echolocatoR_LID/QC_V2.txt
|--------------------------------------------------|
|==================================================|
Checking for empty columns.
Standardising column headers.
First line of summary statistics file: 
CHR BP  SNP MarkerName  Allele1 Allele2 Freq1   FreqSE  MinFreq MaxFreq Effect  StdErr  P-value Direction   HetISq  HetChiSq    HetDf   HetPVal TotalSampleSize MAF_variability 
Summary statistics report:
   - 6,559,907 rows
   - 6,530,651 unique variants
   - 4 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
29,257 SNP IDs are not correctly formatted. These will be corrected from the reference genome.
Loading SNPlocs data.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 6,559,906 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 311 seconds.
13,253 SNPs are not on the reference genome. These will be corrected from the reference genome.
Loading SNPlocs data.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 6,547,467 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 334 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 31 SNPs where neither A1 nor A2 match the reference genome. These will be removed.
There are 3,560,020 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
11,758 RSIDs are duplicated in the sumstats file. These duplicates will be removed
Checking for SNPs with duplicated base-pair positions.
25 base-pair positions are duplicated in the sumstats file. These duplicates will be removed.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y/MT CHR uppercase.
Checking for bi-allelic SNPs.
2,992,104 SNPs are non-biallelic. These will be removed.
Warning: When method is an integer, must be >0.
3,229,709 SNPs (90.8%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted  summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, AF_EFF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Sorting coordinates.
Writing in tabular format ==> /tmp/Rtmp4OEijF/file2efc29f07041.tsv.gz
Summary statistics report:                                                                                                           
   - 3,555,312 rows (54.2% of original 6,559,907 rows)
   - 3,555,312 unique variants
   - 2 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Successfully finished preparing sumstats file, preview:
Reading header.

2. Reproducible example

Code

> reformatted <- 
+   MungeSumstats::format_sumstats(path= prueba,
+                                  ref_genome="GRCh37")

Data

> head(fread(prueba))
|--------------------------------------------------|
|==================================================|
   CHR     BP         SNP MarkerName Allele1 Allele2  Freq1 FreqSE MinFreq MaxFreq  Effect StdErr P-value Direction HetISq HetChiSq HetDf HetPVal TotalSampleSize
1:   1 731718  rs58276399   1:731718       t       c 0.8837 0.0028  0.8800  0.8950 -0.1775 0.1583  0.2621     ?---+      0    0.040     3  0.9979            1297
2:   1 731718 rs142557973   1:731718       t       c 0.8837 0.0028  0.8800  0.8950 -0.1775 0.1583  0.2621     ?---+      0    0.040     3  0.9979            1297
3:   1 734349 rs141242758   1:734349       t       c 0.8843 0.0025  0.8800  0.8950 -0.1577 0.1593  0.3223     ?---+      0    0.143     3  0.9862            1297
4:   1 753541   rs2073813   1:753541       a       g 0.1257 0.0013  0.1050  0.1283  0.0721 0.1177  0.5399     ++++-      0    0.220     4  0.9944            2687
5:   1 766007  rs61768174   1:766007       a       c 0.9005 0.0024  0.8967  0.9076 -0.2559 0.1642  0.1190     ?---+      0    0.065     3  0.9957            1297
6:   1 769223  rs60320384   1:769223       c       g 0.8749 0.0017  0.8728  0.8950 -0.0772 0.1178  0.5124     ----+      0    0.333     4  0.9876            2687
   MAF_variability
1:          0.0150
2:          0.0150
3:          0.0150
4:          0.0233
5:          0.0109
6:          0.0222

3. Session info

> sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] SNPlocs.Hsapiens.dbSNP144.GRCh37_0.99.20 BSgenome_1.65.2                          rtracklayer_1.57.0                       Biostrings_2.65.3                       
 [5] XVector_0.37.1                           GenomicRanges_1.49.1                     GenomeInfoDb_1.33.5                      IRanges_2.31.2                          
 [9] S4Vectors_0.35.3                         BiocGenerics_0.43.1                      forcats_0.5.2                            stringr_1.4.1                           
[13] dplyr_1.0.10                             purrr_0.3.4                              readr_2.1.2                              tidyr_1.2.0                             
[17] tibble_3.1.8                             ggplot2_3.3.6                            tidyverse_1.3.2                          data.table_1.14.2                       
[21] echolocatoR_2.0.1                       

loaded via a namespace (and not attached):
  [1] rappdirs_0.3.3                              GGally_2.1.2                                R.methodsS3_1.8.2                          
  [4] echoLD_0.99.7                               bit64_4.0.5                                 knitr_1.40                                 
  [7] irlba_2.3.5                                 DelayedArray_0.23.1                         R.utils_2.12.0                             
 [10] rpart_4.1.16                                KEGGREST_1.37.3                             RCurl_1.98-1.8                             
 [13] AnnotationFilter_1.21.0                     generics_0.1.3                              GenomicFeatures_1.49.6                     
 [16] callr_3.7.2                                 usethis_2.1.6                               RSQLite_2.2.16                             
 [19] proxy_0.4-27                                chron_2.3-57                                bit_4.0.4                                  
 [22] tzdb_0.3.0                                  httpuv_1.6.5                                xml2_1.3.3                                 
 [25] lubridate_1.8.0                             SummarizedExperiment_1.27.2                 assertthat_0.2.1                           
 [28] viridis_0.6.2                               gargle_1.2.0                                xfun_0.32                                  
 [31] hms_1.1.2                                   promises_1.2.0.1                            evaluate_0.16                              
 [34] fansi_1.0.3                                 restfulr_0.0.15                             progress_1.2.2                             
 [37] dbplyr_2.2.1                                readxl_1.4.1                                Rgraphviz_2.41.1                           
 [40] igraph_1.3.4                                DBI_1.1.3                                   htmlwidgets_1.5.4                          
 [43] reshape_0.8.9                               downloadR_0.99.4                            googledrive_2.0.0                          
 [46] ellipsis_0.3.2                              backports_1.4.1                             biomaRt_2.53.2                             
 [49] deldir_1.0-6                                MatrixGenerics_1.9.1                        MungeSumstats_1.5.13                       
 [52] vctrs_0.4.1                                 Biobase_2.57.1                              remotes_2.4.2                              
 [55] ensembldb_2.21.4                            cachem_1.0.6                                withr_2.5.0                                
 [58] checkmate_2.1.0                             GenomicAlignments_1.33.1                    prettyunits_1.1.1                          
 [61] cluster_2.1.3                               ape_5.6-2                                   dir.expiry_1.5.0                           
 [64] lazyeval_0.2.2                              crayon_1.5.1                                basilisk.utils_1.9.2                       
 [67] crul_1.2.0                                  pkgconfig_2.0.3                             pkgload_1.3.0                              
 [70] nlme_3.1-159                                ProtGenerics_1.29.0                         XGR_1.1.8                                  
 [73] devtools_2.4.4                              nnet_7.3-17                                 rlang_1.0.5                                
 [76] miniUI_0.1.1.1                              lifecycle_1.0.1                             filelock_1.0.2                             
 [79] httpcode_0.3.0                              BiocFileCache_2.5.0                         modelr_0.1.9                               
 [82] echotabix_0.99.8                            dichromat_2.0-0.1                           rprojroot_2.0.3                            
 [85] cellranger_1.1.0                            coloc_5.1.0                                 matrixStats_0.62.0                         
 [88] graph_1.75.0                                Matrix_1.4-1                                osfr_0.2.8                                 
 [91] boot_1.3-28                                 reprex_2.0.2                                base64enc_0.1-3                            
 [94] processx_3.7.0                              googlesheets4_1.0.1                         png_0.1-7                                  
 [97] viridisLite_0.4.1                           rjson_0.2.21                                rootSolve_1.8.2.3                          
[100] bitops_1.0-7                                R.oo_1.25.0                                 ggnetwork_0.5.10                           
[103] blob_1.2.3                                  mixsqp_0.3-43                               echoplot_0.99.5                            
[106] dnet_1.1.7                                  jpeg_0.1-9                                  BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1
[109] echodata_0.99.12                            scales_1.2.1                                memoise_2.0.1                              
[112] magrittr_2.0.3                              plyr_1.8.7                                  hexbin_1.28.2                              
[115] zlibbioc_1.43.0                             compiler_4.2.0                              echoconda_0.99.7                           
[118] BiocIO_1.7.1                                RColorBrewer_1.1-3                          catalogueR_1.0.0                           
[121] Rsamtools_2.13.4                            cli_3.3.0                                   urlchecker_1.0.1                           
[124] echoannot_0.99.7                            ps_1.7.1                                    patchwork_1.1.2                            
[127] htmlTable_2.4.1                             Formula_1.2-4                               MASS_7.3-58.1                              
[130] tidyselect_1.1.2                            stringi_1.7.8                               yaml_2.3.5                                 
[133] supraHex_1.35.0                             latticeExtra_0.6-30                         ggrepel_0.9.1                              
[136] grid_4.2.0                                  VariantAnnotation_1.43.3                    tools_4.2.0                                
[139] lmom_2.9                                    parallel_4.2.0                              rstudioapi_0.14                            
[142] foreign_0.8-82                              piggyback_0.1.3                             gridExtra_2.3                              
[145] gld_2.6.5                                   digest_0.6.29                               snpStats_1.47.1                            
[148] BiocManager_1.30.18                         shiny_1.7.2                                 proto_1.0.0                                
[151] Rcpp_1.0.9                                  broom_1.0.1                                 later_1.3.0                                
[154] OrganismDbi_1.39.1                          httr_1.4.4                                  AnnotationDbi_1.59.1                       
[157] RCircos_1.2.2                               ggbio_1.45.0                                biovizBase_1.45.0                          
[160] colorspace_2.0-3                            brio_1.1.3                                  rvest_1.0.3                                
[163] XML_3.99-0.10                               fs_1.5.2                                    reticulate_1.26                            
[166] splines_4.2.0                               RBGL_1.73.0                                 expm_0.999-6                               
[169] echofinemap_0.99.3                          basilisk_1.9.2                              Exact_3.1                                  
[172] SNPlocs.Hsapiens.dbSNP155.GRCh37_0.99.22    sessioninfo_1.2.2                           xtable_1.8-4                               
[175] jsonlite_1.8.0                              testthat_3.1.4                              susieR_0.12.27                             
[178] R6_2.5.1                                    profvis_0.3.7                               Hmisc_4.7-1                                
[181] gsubfn_0.7                                  mime_0.12                                   pillar_1.8.1                               
[184] htmltools_0.5.3                             glue_1.6.2                                  fastmap_1.1.0                              
[187] DT_0.24                                     BiocParallel_1.31.12                        class_7.3-20                               
[190] codetools_0.2-18                            pkgbuild_1.3.1                              mvtnorm_1.1-3                              
[193] utf8_1.2.2                                  lattice_0.20-45                             sqldf_0.4-11                               
[196] curl_4.3.2                                  DescTools_0.99.46                           zip_2.2.0                                  
[199] openxlsx_4.2.5                              interp_1.1-3                                survival_3.3-1                             
[202] rmarkdown_2.16                              desc_1.4.1                                  googleAuthR_2.0.0                          
[205] munsell_0.5.0                               e1071_1.7-11                                GenomeInfoDbData_1.2.8                     
[208] haven_2.5.1                                 reshape2_1.4.4                              gtable_0.3.1      
Al-Murphy commented 2 years ago

This does appear the same issue as described in https://github.com/neurogenomics/MungeSumstats/issues/111, please do read it to see for advice and how to use MSS without removing non-biallelic SNPs.

I'll close this to avoid duplicate issues.

Cheers, Alan.