Closed quattro closed 2 years ago
This is as I suspected. Looking a bit more at our results at chromosome 1, I now see several chromosomes being represented, which should not be the case.
zcat my_munged_gwas.chr1.tsv.gz | awk '{print $1}' | sort | uniq
1
21
9
CHR
NA
However a bit more digging seems to show that this is the result of the rtracklayer liftover itself, and not any special logic within MungeSumstats.
Lastly, I should note that NCBI remap seems to give the correct chrX mapping for this position, however there seems to be no R-based API.
Not sure if this is somehow related to your other issue but I get the following result when I run this:
test <- data.table::data.table("CHR"="X",
"BP"=50858748,
"A1"="G",
"A2"="A",
"ID"="X:50858748:A:G",
"BETA"=-0.0303521,
"INFO"=0.9292,
"SE"=0.0670233,
"P"=0.650651,
"N"=35649)
format_sumstats(test,ref_genome = "GRCH37",convert_ref_genome = "GRCH38",
rmv_chr = c("MT"))
Output:
Reading header.
SNP CHR BP A1 A2 BETA INFO SE P N
1: rs190850470 X 33174205 G A -0.0303521 0.9292 0.0670233 0.650651 35649
So the chromosome is different to what you report. Anyway though you are correct that the difference is seen from rtracklayer::liftOver so isn't something we can do anything about. I would suggest creating an issue with them on this (You can add me to the issue if you like).
Cheers, Alan.
Hi @Al-Murphy, thanks for your help. I'll reach out there.
However, just to clarify, does your output reflect chromosome X
but position 33174205
?
I get a different output from your example,
> library(data.table)
data.table 1.14.2 using 16 threads (see ?getDTthreads). Latest news: r-datatable.com
> library(MungeSumstats)
> test <- data.table::data.table("CHR"="X",
"BP"=50858748,
"A1"="G",
"A2"="A",
"ID"="X:50858748:A:G",
"BETA"=-0.0303521,
"INFO"=0.9292,
"SE"=0.0670233,
"P"=0.650651,
"N"=35649)
format_sumstats(test,ref_genome = "GRCH37",convert_ref_genome = "GRCH38",
rmv_chr = c("MT"))
******::NOTE::******
- Formatted results will be saved to `tempdir()` by default.
- This means all formatted summary stats will be deleted upon ending the R session.
- To keep formatted summary stats, change `save_path` ( e.g. `save_path=file.path('./formatted',basename(path))` ), or make sure to copy files elsewhere after processing ( e.g. `file.copy(save_path, './formatted/' )`.
********************
Formatted summary statistics will be saved to ==> /tmp/RtmpHlzKTx/filedd0ed348541.tsv.gz
Standardising column headers.
First line of summary statistics file:
CHR BP A1 A2 ID BETA INFO SE P N
Summary statistics report:
- 1 rows
- 1 unique variants
- 0 genome-wide significant variants (P<5e-8)
- 1 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 0 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 13 seconds.
1 SNPs are not on the reference genome. These will be corrected from the reference genome.
Loading SNPlocs data.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 0 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Ensuring that the N column is all integers.
The sumstats N column is not all integers, this could effect downstream analysis. These will be converted to integers.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
Filtering SNPs based on INFO score.
All rows have INFO>=0.9
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y CHR uppercase.
Checking for bi-allelic SNPs.
N already exists within sumstats_dt.
Performing data liftover from hg19 to hg38.
Converting summary statistics to Genomic Ranges.
Downloading chain file from UCSC Genome Browser.
trying URL 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz'
Content type 'unknown' length 227698 bytes (222 KB)
==================================================
/tmp/RtmpHlzKTx/hg19ToHg38.over.chain.gz
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Sorting coordinates.
Writing in tabular format ==> /tmp/RtmpHlzKTx/filedd0ed348541.tsv.gz
Summary statistics report:
- 1 rows (100% of original 1 rows)
- 1 unique variants
- 0 genome-wide significant variants (P<5e-8)
- 1 chromosomes
Successfully finished preparing sumstats file, preview:
Reading header.
SNP CHR BP A1 A2 BETA INFO SE P N
1: rs190850470 2 33174205 G A -0.0303521 0.9292 0.0670233 0.650651 35649
Here is my sessionInfo
> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /home1/nmancuso/projects/miniconda3/envs/munge/lib/libopenblasp-r0.3.18.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] MungeSumstats_1.5.4.1 data.table_1.14.2
loaded via a namespace (and not attached):
[1] MatrixGenerics_1.6.0 Biobase_2.54.0 httr_1.4.2 BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1 jsonlite_1.7.2 bit64_4.0.5
[7] R.utils_2.11.0 assertthat_0.2.1 stats4_4.1.1 BiocFileCache_2.2.0 blob_1.2.2 BSgenome_1.62.0
[13] GenomeInfoDbData_1.2.7 Rsamtools_2.10.0 yaml_2.2.1 progress_1.2.2 pillar_1.6.4 RSQLite_2.2.8
[19] lattice_0.20-45 glue_1.5.0 digest_0.6.28 GenomicRanges_1.46.0 XVector_0.34.0 googleAuthR_1.4.0
[25] Matrix_1.3-4 R.oo_1.24.0 XML_3.99-0.8 pkgconfig_2.0.3 biomaRt_2.50.0 zlibbioc_1.40.0
[31] purrr_0.3.4 BiocParallel_1.28.0 tibble_3.1.6 KEGGREST_1.34.0 generics_0.1.1 IRanges_2.28.0
[37] ellipsis_0.3.2 cachem_1.0.6 SummarizedExperiment_1.24.0 GenomicFeatures_1.46.1 BiocGenerics_0.40.0 cli_3.1.0
[43] magrittr_2.0.1 crayon_1.4.2 memoise_2.0.0 R.methodsS3_1.8.1 fs_1.5.0 fansi_0.5.0
[49] xml2_1.3.2 tools_4.1.1 prettyunits_1.1.1 hms_1.1.1 BiocIO_1.4.0 gargle_1.2.0
[55] lifecycle_1.0.1 matrixStats_0.61.0 stringr_1.4.0 S4Vectors_0.32.2 DelayedArray_0.20.0 AnnotationDbi_1.56.1
[61] Biostrings_2.62.0 compiler_4.1.1 GenomeInfoDb_1.30.0 rlang_0.4.12 grid_4.1.1 RCurl_1.98-1.5
[67] VariantAnnotation_1.40.0 rjson_0.2.20 rappdirs_0.3.3 SNPlocs.Hsapiens.dbSNP144.GRCh37_0.99.20 bitops_1.0-7 restfulr_0.0.13
[73] DBI_1.1.1 curl_4.3.2 R6_2.5.1 GenomicAlignments_1.30.0 dplyr_1.0.7 rtracklayer_1.54.0
[79] fastmap_1.1.0 bit_4.0.4 utf8_1.2.2 filelock_1.0.2 stringi_1.7.5 parallel_4.1.1
[85] Rcpp_1.0.7 vctrs_0.3.8 png_0.1-7 dbplyr_2.1.1 tidyselect_1.1.1
Hey yep it does but this may be a side effect of the change I made regarding case of the sex chromosomes, I'll push that fix to the release version of bioconductor which should go live roughly 2/3 days after. If you could rerun with that version (1.4.5) either by installing directly from the RELEASE_3_15 branch on github or easier, through bioconductor. Let me know if this gives you the same chromosome output as I get
Great will do! I forgot to mention, In case you were wondering, the strange MungeSumstats version (1.5.4.1) in my sessionInfo
was the result of me hacking around to make sure the patch I wrote for the lowercase/uppercase X issue was actually installed.
Thanks again.
I was wondering haha cheers!
Hi all,
Thanks again for the fantastic tool. In performing liftover from GRCh37 -> GRCh38 I noticed a few strange things happening.
The following SNP is in my raw data
which gets mapped to
Inspecting this SNP in dbSNP displays the correct information for a variant with information
chrX:51115913
.This is somewhat concerning as I'm unsure now how often this occurs at other SNPs in the genome after liftover calls. I hope you can help me in getting to the bottom of this.
Best, Nick