Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

Posthuma2021 GWAS #99

Closed bschilder closed 2 years ago

bschilder commented 2 years ago

1. Bug description

I noticed a couple things when trying to munge the sumstats from the Posthuma 2021 AD GWAS provided here: https://ctg.cncr.nl/software/summary_statistics/

Expected behaviour

2. Reproducible example

Code

id <- "Posthuma-AD-2021"
gwas_paths <- MungeSumstats::format_sumstats(
  path = "https://ctg.cncr.nl/documents/p1651/PGCALZ2sumstatsExcluding23andMe.txt.gz",
  save_path = here::here("data/GWAS_sumstats",paste0(id,".tsv.gz")),
  ref_genome = "GRCh37",
  nThread = 40,
  force_new = FALSE,
  #### Record logs  
  log_folder = here::here("data/GWAS_sumstats",id,"logs"),
  log_folder_ind = TRUE,
  log_mungesumstats_msgs = TRUE) 

Console output

Reading remote sumstats

Looks like the remote file isn't being recognized. I confirmed it's a valid tsv by importing with fread manually.

Formatted summary statistics will be saved to ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/Posthuma-AD-2021.tsv.gz
Log data to be saved to ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/Posthuma-AD-2021/logs
Error in validate_parameters(path = path, ref_genome = ref_genome, convert_ref_genome = convert_ref_genome,  : 
  Path to GWAS sumstats is not valid, pass a file path or adataframe/data.table object to the path parameter

New column mappings

Also the header contains:

# header: "chr" "PosGRCh37" "testedAllele" "otherAllele" "z" "p" "N"  

We'll need to add to the mapping files accordingly:

We already have "tested.allele" and "tested_allele", but not a version with no separator. Given the variety of separators that can be used, I think it would be worth doing some standardization in addition to making cols uppercase. That way we don't have to include every iteration of each uncorrected col in the colmap file. Something like:

cnames <- gsub("[.]|[-]|[_]|[ ]","_",colnames(dat))
colmap <- MungeSumstats:::sumstatsColHeaders
colmap$Uncorrected <- gsub("[.]|[-]|[_]|[ ]","_",colmap$Uncorrected)

Missing SNP col

Lastly, the sumstats are missing a SNP column for some reason.... This causes the following error:

/MAGMA_Files_Public/data/GWAS_sumstats/Posthuma-AD-2021/logs/MungeSumstats_log_msg.txt
Reading header.
Tabular format detected.
Importing tabular file: /tmp/Rtmp061dc1/PGCALZ2sumstatsExcluding23andMe.txt.gz
|--------------------------------------------------|
|==================================================|
Standardising column headers.
First line of summary statistics file: 
chr PosGRCh37   testedAllele    otherAllele z   p   N   
Summary statistics report:
   - 12,688,339 rows
   - 3,570 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking for merged allele column.
Checking A1 is uppercase
Summary statistics file does not have obvious CHR/BP columns. Checking to see if they are joined in another column.
Standardising column headers.
First line of summary statistics file: 
CHR POSGRCH37   TESTEDALLELE    A1  Z   P   N   
Error in check_vital_col(sumstats_dt = sumstats_return$sumstats_dt) : 
  Cannot find a SNP column in GWAS sumstats. 
Use code such as 'sed -i '' '1s/p_value/P/' IQ.Sniekers.2017.txt' to fix

Since we do some imputation of RSIDs at some point, and the genome build is known for this GWAS ("GRCh37") could we just impute the RSIDS of all SNPs and make a new column?

No need to worry about this until after Easter break, just putting here so I don't forget.

3. Session info

``` R Under development (unstable) (2022-02-25 r81808) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS Matrix products: default BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] dplyr_1.0.8 ggplot2_3.3.5 MungeSumstats_1.3.17 MAGMA.Celltyping_2.0.2 loaded via a namespace (and not attached): [1] backports_1.4.1 AnnotationHub_3.3.8 BiocFileCache_2.3.4 [4] plyr_1.8.7 googleAuthR_2.0.0 lazyeval_0.2.2 [7] splines_4.2.0 orthogene_1.1.1 ewceData_1.3.0 [10] BiocParallel_1.29.20 GenomeInfoDb_1.31.7 digest_0.6.29 [13] htmltools_0.5.2 RNOmni_1.0.0 fansi_1.0.3 [16] magrittr_2.0.3 memoise_2.0.1 BSgenome_1.63.6 [19] limma_3.51.5 Biostrings_2.63.3 matrixStats_0.61.0 [22] R.utils_2.11.0 prettyunits_1.1.1 colorspace_2.0-3 [25] blob_1.2.3 rappdirs_0.3.3 xfun_0.30 [28] crayon_1.5.1 RCurl_1.98-1.6 jsonlite_1.8.0 [31] lme4_1.1-28 VariantAnnotation_1.41.3 glue_1.6.2 [34] gtable_0.3.0 gargle_1.2.0 zlibbioc_1.41.0 [37] XVector_0.35.0 HGNChelper_0.8.1 DelayedArray_0.21.2 [40] car_3.0-12 SingleCellExperiment_1.17.2 BiocGenerics_0.41.2 [43] abind_1.4-5 scales_1.1.1 DBI_1.1.2 [46] rstatix_0.7.0 Rcpp_1.0.8.3 viridisLite_0.4.0 [49] xtable_1.8-4 progress_1.2.2 bit_4.0.4 [52] stats4_4.2.0 htmlwidgets_1.5.4 httr_1.4.2 [55] ellipsis_0.3.2 pkgconfig_2.0.3 XML_3.99-0.9 [58] R.methodsS3_1.8.1 dbplyr_2.1.1 utf8_1.2.2 [61] here_1.0.1 tidyselect_1.1.2 rlang_1.0.2 [64] reshape2_1.4.4 later_1.3.0 AnnotationDbi_1.57.1 [67] munsell_0.5.0 BiocVersion_3.15.0 tools_4.2.0 [70] cachem_1.0.6 cli_3.2.0 generics_0.1.2 [73] RSQLite_2.2.12 ExperimentHub_2.3.5 broom_0.7.12 [76] evaluate_0.15 stringr_1.4.0 fastmap_1.1.0 [79] ggdendro_0.1.23 yaml_2.3.5 babelgene_21.4 [82] knitr_1.38 bit64_4.0.5 fs_1.5.2 [85] purrr_0.3.4 KEGGREST_1.35.0 gprofiler2_0.2.1 [88] nlme_3.1-155 mime_0.12 R.oo_1.24.0 [91] xml2_1.3.3 biomaRt_2.51.3 compiler_4.2.0 [94] rstudioapi_0.13 plotly_4.10.0 filelock_1.0.2 [97] curl_4.3.2 png_0.1-7 interactiveDisplayBase_1.33.0 [100] ggsignif_0.6.3 tibble_3.1.6 EWCE_1.3.3 [103] homologene_1.4.68.19.3.27 stringi_1.7.6 GenomicFeatures_1.47.13 [106] lattice_0.20-45 Matrix_1.4-0 nloptr_2.0.0 [109] vctrs_0.4.1 pillar_1.7.0 lifecycle_1.0.1 [112] BiocManager_1.30.16 data.table_1.14.2 bitops_1.0-7 [115] httpuv_1.6.5 patchwork_1.1.1 rtracklayer_1.55.4 [118] GenomicRanges_1.47.6 R6_2.5.1 BiocIO_1.5.0 [121] promises_1.2.0.1 gridExtra_2.3 IRanges_2.29.1 [124] boot_1.3-28 MASS_7.3-55 assertthat_0.2.1 [127] SummarizedExperiment_1.25.3 rprojroot_2.0.3 rjson_0.2.21 [130] withr_2.5.0 GenomicAlignments_1.31.2 Rsamtools_2.11.0 [133] S4Vectors_0.33.17 GenomeInfoDbData_1.2.8 parallel_4.2.0 [136] hms_1.1.1 grid_4.2.0 minqa_1.2.4 [139] tidyr_1.2.0 rmarkdown_2.13 MatrixGenerics_1.7.0 [142] carData_3.0-5 ggpubr_0.4.0 Biobase_2.55.2 [145] shiny_1.7.1 restfulr_0.0.13 ```
bschilder commented 2 years ago

Quick update:

Dummy SNPs: integers

Tried replacing the SNP col with a dummy col of row numbers.

dat <- data.table::fread("https://ctg.cncr.nl/documents/p1651/PGCALZ2sumstatsExcluding23andMe.txt.gz")
dat[,SNP:=seq_len(nrow(dat))]
tmp <- file.path(tempdir(),basename(path))
data.table::fwrite(dat, tmp, sep="\t", nThread = 50)

but this produces a different error when running it through format_sumstats:

data/GWAS_sumstats/PGC-ALZ2/logs/MungeSumstats_log_msg.txt
Reading header.
Tabular format detected.
Importing tabular file: /tmp/Rtmp061dc1/PGCALZ2sumstatsExcluding23andMe.txt.gz
|--------------------------------------------------|
|==================================================|
Standardising column headers.
First line of summary statistics file: 
chr PosGRCh37   testedAllele    otherAllele z   p   N   SNP 
Summary statistics report:
   - 12,688,339 rows
   - 12,688,339 unique variants
   - 3,570 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Ensuring all SNPs are on the reference genome.
Loading reference genome data.
Error in seq_len(nrow(x))[-irows] : 
  only 0's may be mixed with negative subscripts

Dummy SNPs: characters

However, when I make all dummy RSIDs characters by adding the prefix "snp", the pipeline runs to completion:

dat[,SNP:=paste0("snp",seq_len(nrow(dat)))]
data/GWAS_sumstats/PGC-ALZ2/logs/MungeSumstats_log_msg.txt
Reading header.
Tabular format detected.
Importing tabular file: /tmp/Rtmp061dc1/PGCALZ2sumstatsExcluding23andMe.txt.gz
|--------------------------------------------------|
|==================================================|
|--------------------------------------------------|
|==================================================|
Standardising column headers.
First line of summary statistics file: 
chr PosGRCh37   testedAllele    otherAllele z   p   N   SNP 
Summary statistics report:
   - 12,688,339 rows
   - 12,688,339 unique variants
   - 3,570 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Ensuring all SNPs are on the reference genome.
Loading reference genome data.
12,688,339 SNPs are not on the reference genome. These will be corrected from the reference genome.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/PGC-ALZ2/logs/snp_not_found_from_chr_bp.tsv.gz
Loading reference genome data.
There is no A1 or A2 allele information column found within the data. It must be inferred from  other column information.
One of A1/A2 are missing, allele flipping will be tested
Deriving A2 from reference genome
WARNING: Inferring the alternative allele (A2) from the reference genome. In some instances, there are more than one
alternative allele. Arbitrarily, only the first will be kept. See column `alt_alleles` in your returned sumstats file
for all alternative alleles.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/PGC-ALZ2/logs/alleles_not_found_from_snp.tsv.gz
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 877,359 SNPs where neither A1 nor A2 match the reference genome. These will be removed.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/PGC-ALZ2/logs/alleles_dont_match_ref_gen.tsv.gz
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
20,228 RSIDs are duplicated in the sumstats file. These duplicates will be removed
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/PGC-ALZ2/logs/dup_snp_id.tsv.gz
Checking for SNPs with duplicated base-pair positions.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y CHR uppercase.
Checking for bi-allelic SNPs.
343,205 SNPs are non-biallelic. These will be removed.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/PGC-ALZ2/logs/snp_bi_allelic.tsv.gz
N already exists within sumstats_dt.
Could not recognize genome build of:
 - target_genome
These will be inferred from the data.
Sorting coordinates.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/PGC-ALZ2.tsv.gz
Summary statistics report:
   - 11,158,539 rows (87.9% of original 12,688,339 rows)
   - 11,151,777 unique variants
   - 2,432 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Successfully finished preparing sumstats file, preview:
Reading header.
           SNP CHR    BP A1 A2 TESTEDALLELE           Z      P     N alt_alleles
1: rs533499096   1 14860  G  A            T -1.27587418 0.2020 74004           A
2: rs373918278   1 17765  G  A            A  0.09388564 0.9252 74004           A
3: rs763779009   1 17767  G  A            A -0.42614801 0.6700 74004           A
4: rs199900651   1 48186  T  G            G  0.91308089 0.3612 74004           G
Al-Murphy commented 2 years ago

Hello!!

We'll need to add to the mapping files accordingly: "PosGRCh37" --> "CHR" "testedAllele" --> "A1"

Do you mean BP not CHR? And the column matching already makes the columns upper case before checking. I'll add these though (and a few others like PosGRCh38 etc)!

On the main issue of not recognising remote sumstats - MungeSumstats wasn't really made to do this (just yet). I think the best solution would be to download the sumstats as an extra step before running any format checks if the file inputted is a https other than an IEU GWAS. It can then be treated like any local sumstats. I think this approach is necessary as the remote file could be a vcf or any other type so it will need to go through the same checks as a local sumstats anyway. Also it is hard to check if a remote file actually exists (see here) so just trying to download it before a lot of other code runs and giving a meaningful error through this seems best.

Let me know if I'm missing anything with this solution? Should get a fix pushed today otherwise.

Cheers

bschilder commented 2 years ago

Do you mean BP not CHR? And the column matching already makes the columns upper case before checking. I'll add these though (and a few others like PosGRCh38 etc)!

Yup! My bad, just corrected this in the original post. Thanks!

On the main issue of not recognising remote sumstats - MungeSumstats wasn't really made to do this (just yet). I think the best solution would be to download the sumstats as an extra step before running any format checks if the file inputted is a https other than an IEU GWAS. It can then be treated like any local sumstats. I think this approach is necessary as the remote file could be a vcf or any other type so it will need to go through the same checks as a local sumstats anyway. Also it is hard to check if a remote file actually exists (see here) so just trying to download it before a lot of other code runs and giving a meaningful error through this seems best.

I think I actually added some code to recognize remote URLs, but I only accounted for OpenGWS VCFs. Should be easy enough to adapt to other remote files though, since we could still infer whether it's vcf or tabular format from the URL, and both our read_vcf function and fread are able to read in remote files: https://github.com/neurogenomics/MungeSumstats/blob/c3c30269cbd7ba0f18b32306484988c5e215c732/R/read_vcf.R#L45

You're right though in that checking whether that remote file actually exists is another issue. But I think enabling URLs to be passed to path would be worth it bc it means users don't have to do the extra step of downloading the file beforehand.

Al-Murphy commented 2 years ago

Okay it should be able to handle remote sumstats now, I'm using fread to read in the header and then letting the current code figure out what to do with it from there. Note even though you set the n with fread, it still downloads the full file so this is a little bit of a repeated step but doesn't add much time at all if your internet access is okay. See below for the output will push once it passes checks locally:

> gwas_paths <- MungeSumstats::format_sumstats(
+   path = "https://ctg.cncr.nl/documents/p1651/PGCALZ2sumstatsExcluding23andMe.txt.gz",
+   save_path = "~/Downloads/test.tsv.gz",
+   ref_genome = "GRCh37",
+   nThread = 40,
+   force_new = TRUE,
+   #### Record logs  
+   #log_folder = "~/Downloads/",
+   #log_folder_ind = TRUE,
+   #log_mungesumstats_msgs = TRUE
+   ) 
Formatted summary statistics will be saved to ==> /Users/alanmurphy/Downloads/test.tsv.gz
Reading header.
 [100%] Downloaded 314825956 bytes...
Tabular format detected.
Importing tabular file: https://ctg.cncr.nl/documents/p1651/PGCALZ2sumstatsExcluding23andMe.txt.gz
 [100%] Downloaded 314825956 bytes...
|--------------------------------------------------|
|==================================================|
Standardising column headers.
First line of summary statistics file: 
chr PosGRCh37   testedAllele    otherAllele z   p   N   
Summary statistics report:
   - 12,688,339 rows
   - 3,570 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
There is no SNP column found within the data. It must be inferred from other column information.
Ensuring all SNPs are on the reference genome.
Loading reference genome data.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 6,185 SNPs where neither A1 nor A2 match the reference genome. These will be removed.
There are 871,174 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
22,977 RSIDs are duplicated in the sumstats file. These duplicates will be removed
Checking for SNPs with duplicated base-pair positions.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y CHR uppercase.
Checking for bi-allelic SNPs.
372,165 SNPs are non-biallelic. These will be removed.
N already exists within sumstats_dt.
Could not recognize genome build of:
 - target_genome
These will be inferred from the data.
Sorting coordinates.
Writing in tabular format ==> /Users/alanmurphy/Downloads/test.tsv.gz
Summary statistics report:                                                                                                           
   - 11,998,590 rows (94.6% of original 12,688,339 rows)
   - 11,991,381 unique variants
   - 2,620 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Successfully finished preparing sumstats file, preview:
Reading header.
           SNP CHR    BP A1 A2           Z      P     N
1: rs533499096   1 14860  G  T -1.27587418 0.2020 74004
2: rs373918278   1 17765  G  A  0.09388564 0.9252 74004
3: rs763779009   1 17767  G  A -0.42614801 0.6700 74004
4: rs199900651   1 48186  T  G  0.91308089 0.3612 74004
Returning path to saved data.
bschilder commented 2 years ago

Awesome, thanks!

Did you have a chance to look at this as well?: https://github.com/neurogenomics/MungeSumstats/issues/99#issuecomment-1098516280

Al-Murphy commented 2 years ago

Ah nope forgot that!

Okay so that error was caused because you passed values to SNP that weren't RS IDs (don't start with rs), and didn't set snp_ids_are_rs_ids=TRUE. So it threw an error when looking on the reference genome for them. I've added a step that if all SNP values don't start with rs id and snp_ids_are_rs_ids=FALSE, then append rs at the start of each (i.e. "1" => "rs1" etc). This covers use if some sumstats use rs ids but don't start with rs. I've also added a warning for the user that if this happens they are told about the snp_ids_are_rs_ids input variable.

I'll push it in the next few minutes

bschilder commented 2 years ago

Ah, i gotcha. But the only reason i did that was because the sumstats were missing a SNP col entirely (see the "Missing SNP col" section in my original post):

Missing SNP col

Lastly, the sumstats are missing a SNP column for some reason.... This causes the following error:

I don't think setting the snp_ids_are_rs_ids arg would work in this case bc the error occurs at the step where required cols are checked.

Okay so that error was caused because you passed values to SNP that weren't RS IDs (don't start with rs), and didn't set snp_ids_are_rs_ids=TRUE. So it threw an error when looking on the reference genome for them. I've added a step that if all SNP values don't start with rs id and snp_ids_are_rs_ids=FALSE, then append rs at the start of each (i.e. "1" => "rs1" etc). This covers use if some sumstats use rs ids but don't start with rs. I've also added a warning for the user that if this happens they are told about the snp_ids_are_rs_ids input variable.

Sounds good to me, except I would actually use the prefix "snp" or something other than "rs" simply to make it impossible to accidentally assign a real RSID at the wrong position (which might occur in certain situations where certain checks are turned off.)

Al-Murphy commented 2 years ago

Yeah the SNP column wasn't being imputed as normal for you because the BP column also didn't exist so it couldn't use that. It runs fine now (if you want to test the latest version there just to make sure)?

Sounds good to me, except I would actually use the prefix "snp" or something other than "rs" simply to make it impossible to accidentally assign a real RSID at the wrong position (which might occur in certain situations where certain checks are turned off.)

Can't use snp or anything else as this is used to match against the reference genome so needs to be rs.

bschilder commented 2 years ago

Yeah the SNP column wasn't being imputed as normal for you because the BP column also didn't exist so it couldn't use that. It runs fine now (if you want to test the latest version there just to make sure)?

OK cool, I'll confirm once my current run finishes.

Can't use snp or anything else as this is used to match against the reference genome so needs to be rs.

What about SNPs where it's simply a chr:bp string? Those aren't simply removed are they? I thought MungeSumstats imputed those. Why not do the same with dummy RSIDS with some non-"rs" prefix? Using the "snp" prefix in the dummy column seemed to work well for me.

Al-Murphy commented 2 years ago

Ah actually good point, I remember thinking this through before now.

This change won't be able to pick up SNPs where it's simply a chr:bp string (or any other columns combine which is quite common). I'm going to revert it now.

Adding a prefix like snp won't help since mungesumstats uses the SNP column to impute/check on reference genome so it will throw an error. Only way I can see around it is with the snp_ids_are_rs_ids variable so the SNP column can be renamed in these instances.

Let me know if you see a way around this I'm missing?