Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
73 stars 16 forks source link

Error in split out chr:bp (check_no_rs_snp) #120

Closed abigailterkuile closed 2 years ago

abigailterkuile commented 2 years ago

Error in split out chr:bp (check_no_rs_snp)

Hi there,

I have GWAS summary statistics with a mixture of rsIDs and CHR:BP:A2:A1 - e.g: CHR BP SNP A1 A2
1 701203 chr1:701203:G:T T G
1 710225 rs185127847 A T
1 722408 chr1:722408:G:C C G

I get an error message when running format_sumstats:

Code

reformatted <- 
  MungeSumstats::format_sumstats(path=GAD_nihr_Pth,
                                 ref_genome="GRCh38",
                                 save_path=file.path(file_path_save, basename(GAD_formatted)))

Console output

Standardising column headers.
First line of summary statistics file:
CHR BP  SNP A1  A2  A1FREQ  INFO    N   BETA    SE  P
Summary statistics report:
   - 14,245,969 rows
   - 14,245,969 unique variants
   - 1 genome-wide significant variants (P<5e-8)
   - 23 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
0 SNP IDs appear to be made up of chr:bp, these will be replaced by their SNP ID from the reference genome
Error in if (nchar(splits[1]) < nchar(splits[2])) { :
  missing value where TRUE/FALSE needed
Calls: <Anonymous> -> check_no_rs_snp
Execution halted

Session info

``` > sessionInfo() R version 4.1.1 (2021-08-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.4 LTS Matrix products: default BLAS: /software/spackages_prod/apps/linux-ubuntu20.04-zen2/gcc-9.4.0/r-4.1.1-w2swqhtlwhxzg45ipiuh2fnj4zefy7yh/rlib/R/lib/libRblas.so LAPACK: /software/spackages_prod/apps/linux-ubuntu20.04-zen2/gcc-9.4.0/r-4.1.1-w2swqhtlwhxzg45ipiuh2fnj4zefy7yh/rlib/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] MungeSumstats_1.2.4 loaded via a namespace (and not attached): [1] MatrixGenerics_1.6.0 Biobase_2.54.0 [3] httr_1.4.4 bit64_4.0.5 [5] jsonlite_1.8.0 R.utils_2.12.0 [7] assertthat_0.2.1 stats4_4.1.1 [9] BiocFileCache_2.2.1 blob_1.2.3 [11] BSgenome_1.62.0 GenomeInfoDbData_1.2.7 [13] Rsamtools_2.10.0 yaml_2.3.5 [15] progress_1.2.2 pillar_1.8.1 [17] RSQLite_2.2.16 lattice_0.20-45 [19] glue_1.6.2 digest_0.6.29 [21] GenomicRanges_1.46.1 XVector_0.34.0 [23] googleAuthR_2.0.0 Matrix_1.4-1 [25] R.oo_1.25.0 XML_3.99-0.10 [27] pkgconfig_2.0.3 biomaRt_2.50.3 [29] zlibbioc_1.40.0 purrr_0.3.4 [31] BiocParallel_1.28.3 tibble_3.1.8 [33] KEGGREST_1.34.0 generics_0.1.3 [35] IRanges_2.28.0 ellipsis_0.3.2 [37] cachem_1.0.6 SummarizedExperiment_1.24.0 [39] GenomicFeatures_1.46.5 BiocGenerics_0.40.0 [41] cli_3.3.0 magrittr_2.0.3 [43] crayon_1.5.1 memoise_2.0.1 [45] R.methodsS3_1.8.2 fs_1.5.2 [47] fansi_1.0.3 xml2_1.3.3 [49] tools_4.1.1 data.table_1.14.2 [51] prettyunits_1.1.1 hms_1.1.2 [53] BiocIO_1.4.0 gargle_1.2.0 [55] lifecycle_1.0.1 matrixStats_0.62.0 [57] stringr_1.4.1 S4Vectors_0.32.4 [59] DelayedArray_0.20.0 AnnotationDbi_1.56.2 [61] Biostrings_2.62.0 compiler_4.1.1 [63] GenomeInfoDb_1.30.1 rlang_1.0.4 [65] grid_4.1.1 RCurl_1.98-1.8 [67] VariantAnnotation_1.40.0 rjson_0.2.21 [69] rappdirs_0.3.3 bitops_1.0-7 [71] restfulr_0.0.15 DBI_1.1.3 [73] curl_4.3.2 R6_2.5.1 [75] GenomicAlignments_1.30.0 dplyr_1.0.9 [77] rtracklayer_1.54.0 fastmap_1.1.0 [79] bit_4.0.4 utf8_1.2.2 [81] filelock_1.0.2 stringi_1.7.8 [83] parallel_4.1.1 Rcpp_1.0.9 [85] vctrs_0.4.1 png_0.1-7 [87] dbplyr_2.2.1 tidyselect_1.1.2 ```
Al-Murphy commented 2 years ago

Hi,

Thanks for posting the issue. MSS should be able to handle a mixture of RS IDs and CHR:BP:A2:A1.

Can you upload a small summary statistics file that replicates this issue (or a link to the full summary statistics)? I can't replicate the issue with the example SNPs you list:

CHR BP SNP A1 A2 P beta
1 701203 chr1:701203:G:T T G 0.8 1.2
1 710225 rs185127847 A T 0.65 .001
1 722408 chr1:722408:G:C C G 0.45 .01233

It runs fine:

> sumstats <- fread("~/Downloads/test.txt")
> reformatted <- 
+   MungeSumstats::format_sumstats(sumstats,
+                                  ref_genome="GRCh38")

******::NOTE::******
 - Formatted results will be saved to `tempdir()` by default.
 - This means all formatted summary stats will be deleted upon ending the R session.
 - To keep formatted summary stats, change `save_path`  ( e.g. `save_path=file.path('./formatted',basename(path))` ),   or make sure to copy files elsewhere after processing  ( e.g. `file.copy(save_path, './formatted/' )`.
 ******************** 

Formatted summary statistics will be saved to ==>  /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmp6XstOa/file2cb295c2601.tsv.gz
Standardising column headers.
First line of summary statistics file: 
CHR BP  SNP A1  A2  P   beta    
Summary statistics report:
   - 3 rows
   - 3 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 1 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
2 SNP IDs are not correctly formatted. These will be corrected from the reference genome.
Loading SNPlocs data.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 2 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 12 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 2 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
SE is not present but can be imputed with BETA & P. Set impute_se=TRUE and rerun to do this.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y CHR uppercase.
Checking for bi-allelic SNPs.
Warning: When method is an integer, must be >0.
Could not recognize genome build of:
 - target_genome
These will be inferred from the data.
Sorting coordinates.
Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmp6XstOa/file2cb295c2601.tsv.gz
Summary statistics report:
   - 2 rows (66.7% of original 3 rows)
   - 2 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 1 chromosomes
Successfully finished preparing sumstats file, preview:
Reading header.
           SNP CHR     BP A1 A2    P     BETA
1: rs185127847   1 710225  T  A 0.65 -0.00100
2: rs760310201   1 722408  G  C 0.45 -0.01233
Returning path to saved data.
abigailterkuile commented 2 years ago

Thanks for the quick response. Here's a small version of the summary statistics file: GAD_1000_snps.txt

Al-Murphy commented 2 years ago

Thanks for that however it still runs without issue for me, could you. try installing the latest version of MSS from github to see if that helps?

Here is my run output:

sumstats <- fread("~/Downloads/GAD_1000_snps.txt")
> reformatted <- 
+   MungeSumstats::format_sumstats(sumstats,
+                                  ref_genome="GRCh38")

******::NOTE::******
 - Formatted results will be saved to `tempdir()` by default.
 - This means all formatted summary stats will be deleted upon ending the R session.
 - To keep formatted summary stats, change `save_path`  ( e.g. `save_path=file.path('./formatted',basename(path))` ),   or make sure to copy files elsewhere after processing  ( e.g. `file.copy(save_path, './formatted/' )`.
 ******************** 

Formatted summary statistics will be saved to ==>  /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmp6XstOa/file2cb2aa2becc.tsv.gz
Standardising column headers.
First line of summary statistics file: 
CHR BP  SNP A1  A2  A1FREQ  INFO    N   BETA    SE  P   
Summary statistics report:
   - 999 rows
   - 999 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 1 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
273 SNP IDs are not correctly formatted. These will be corrected from the reference genome.
Loading SNPlocs data.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 986 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 12 seconds.
49 SNPs are not on the reference genome. These will be corrected from the reference genome.
Loading SNPlocs data.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 939 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 12 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 940 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
1 RSIDs are duplicated in the sumstats file. These duplicates will be removed
Checking for SNPs with duplicated base-pair positions.
Filtering SNPs based on INFO score.
702 SNPs are below the INFO threshold of 0.9 and will be removed.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y CHR uppercase.
Checking for bi-allelic SNPs.
11 SNPs are non-biallelic. These will be removed.
N already exists within sumstats_dt.
195 SNPs (86.3%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted  summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Could not recognize genome build of:
 - target_genome
These will be inferred from the data.
Sorting coordinates.
Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmp6XstOa/file2cb2aa2becc.tsv.gz
Summary statistics report:
   - 226 rows (22.6% of original 999 rows)
   - 226 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 1 chromosomes
Successfully finished preparing sumstats file, preview:
Reading header.
           SNP CHR     BP A1 A2       FRQ     INFO     N       BETA        SE         P
1: rs138388092   1 802843  T  C 0.9828785 0.903520 20452 -0.1761780 0.1699330 0.2998513
2:  rs12562034   1 833068  G  A 0.8941450 0.999894 20452  0.0160558 0.0694009 0.8170437
3: rs151160018   1 841176  C  T 0.9918606 0.903191 20452 -0.1062280 0.2545640 0.6764629
4: rs112618790   1 841852  C  T 0.9103400 0.939107 20452  0.0328076 0.0775737 0.6723525
Returning path to saved data.
abigailterkuile commented 2 years ago

Updating the package to the latest version from GitHub resolved the issue. Thanks very much, Alan!