Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

Incorrect CHR assignment after liftover #104

Closed quattro closed 2 years ago

quattro commented 2 years ago

Hi all,

Thanks again for the fantastic tool. In performing liftover from GRCh37 -> GRCh38 I noticed a few strange things happening.

The following SNP is in my raw data

X   50858748    A   G   X:50858748:A:G  -0.0303521  0.9292  0.0670233   0.650651    35649

which gets mapped to

           SNP CHR       BP A1 A2             ID       BETA     INFO        SE        P     N
1: rs190850470   2 33174205  G  A X:50858748:A:G  0.0303521 0.929200 0.0670233 0.650651 35649

Inspecting this SNP in dbSNP displays the correct information for a variant with information chrX:51115913.

This is somewhat concerning as I'm unsure now how often this occurs at other SNPs in the genome after liftover calls. I hope you can help me in getting to the bottom of this.

Best, Nick

quattro commented 2 years ago

This is as I suspected. Looking a bit more at our results at chromosome 1, I now see several chromosomes being represented, which should not be the case.

zcat my_munged_gwas.chr1.tsv.gz | awk '{print $1}' | sort | uniq
1
21
9
CHR
NA

However a bit more digging seems to show that this is the result of the rtracklayer liftover itself, and not any special logic within MungeSumstats.

Lastly, I should note that NCBI remap seems to give the correct chrX mapping for this position, however there seems to be no R-based API.

Al-Murphy commented 2 years ago

Not sure if this is somehow related to your other issue but I get the following result when I run this:

test <- data.table::data.table("CHR"="X",
                               "BP"=50858748,
                               "A1"="G",    
                               "A2"="A",
                               "ID"="X:50858748:A:G",
                               "BETA"=-0.0303521,
                               "INFO"=0.9292,   
                               "SE"=0.0670233,  
                               "P"=0.650651,    
                               "N"=35649)
format_sumstats(test,ref_genome = "GRCH37",convert_ref_genome = "GRCH38",
                rmv_chr = c("MT"))

Output:
Reading header.
SNP CHR       BP A1 A2       BETA   INFO        SE        P     N
1: rs190850470   X 33174205  G  A -0.0303521 0.9292 0.0670233 0.650651 35649

So the chromosome is different to what you report. Anyway though you are correct that the difference is seen from rtracklayer::liftOver so isn't something we can do anything about. I would suggest creating an issue with them on this (You can add me to the issue if you like).

Cheers, Alan.

quattro commented 2 years ago

Hi @Al-Murphy, thanks for your help. I'll reach out there.

However, just to clarify, does your output reflect chromosome X but position 33174205 ?

I get a different output from your example,

> library(data.table)
data.table 1.14.2 using 16 threads (see ?getDTthreads).  Latest news: r-datatable.com
> library(MungeSumstats)
> test <- data.table::data.table("CHR"="X",
                               "BP"=50858748,
                               "A1"="G",        
                               "A2"="A",
                               "ID"="X:50858748:A:G",
                               "BETA"=-0.0303521,
                               "INFO"=0.9292,   
                               "SE"=0.0670233,  
                               "P"=0.650651,    
                               "N"=35649)
format_sumstats(test,ref_genome = "GRCH37",convert_ref_genome = "GRCH38",
                rmv_chr = c("MT"))

******::NOTE::******
 - Formatted results will be saved to `tempdir()` by default.
 - This means all formatted summary stats will be deleted upon ending the R session.
 - To keep formatted summary stats, change `save_path`  ( e.g. `save_path=file.path('./formatted',basename(path))` ),   or make sure to copy files elsewhere after processing  ( e.g. `file.copy(save_path, './formatted/' )`.
 ******************** 

Formatted summary statistics will be saved to ==>  /tmp/RtmpHlzKTx/filedd0ed348541.tsv.gz
Standardising column headers.
First line of summary statistics file: 
CHR BP  A1  A2  ID  BETA    INFO    SE  P   N   
Summary statistics report:
   - 1 rows
   - 1 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 1 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 0 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 13 seconds.
1 SNPs are not on the reference genome. These will be corrected from the reference genome.
Loading SNPlocs data.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 0 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Ensuring that the N column is all integers.
The sumstats N column is not all integers, this could effect downstream analysis. These will be converted to integers.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
Filtering SNPs based on INFO score.
All rows have INFO>=0.9
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y CHR uppercase.
Checking for bi-allelic SNPs.
N already exists within sumstats_dt.
Performing data liftover from hg19 to hg38.
Converting summary statistics to Genomic Ranges.
Downloading chain file from UCSC Genome Browser.
trying URL 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz'
Content type 'unknown' length 227698 bytes (222 KB)
==================================================
/tmp/RtmpHlzKTx/hg19ToHg38.over.chain.gz
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Sorting coordinates.
Writing in tabular format ==> /tmp/RtmpHlzKTx/filedd0ed348541.tsv.gz
Summary statistics report:
   - 1 rows (100% of original 1 rows)
   - 1 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 1 chromosomes
Successfully finished preparing sumstats file, preview:
Reading header.
           SNP CHR       BP A1 A2       BETA   INFO        SE        P     N
1: rs190850470   2 33174205  G  A -0.0303521 0.9292 0.0670233 0.650651 35649

Here is my sessionInfo

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /home1/nmancuso/projects/miniconda3/envs/munge/lib/libopenblasp-r0.3.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] MungeSumstats_1.5.4.1 data.table_1.14.2    

loaded via a namespace (and not attached):
 [1] MatrixGenerics_1.6.0                        Biobase_2.54.0                              httr_1.4.2                                  BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1 jsonlite_1.7.2                              bit64_4.0.5                                
 [7] R.utils_2.11.0                              assertthat_0.2.1                            stats4_4.1.1                                BiocFileCache_2.2.0                         blob_1.2.2                                  BSgenome_1.62.0                            
[13] GenomeInfoDbData_1.2.7                      Rsamtools_2.10.0                            yaml_2.2.1                                  progress_1.2.2                              pillar_1.6.4                                RSQLite_2.2.8                              
[19] lattice_0.20-45                             glue_1.5.0                                  digest_0.6.28                               GenomicRanges_1.46.0                        XVector_0.34.0                              googleAuthR_1.4.0                          
[25] Matrix_1.3-4                                R.oo_1.24.0                                 XML_3.99-0.8                                pkgconfig_2.0.3                             biomaRt_2.50.0                              zlibbioc_1.40.0                            
[31] purrr_0.3.4                                 BiocParallel_1.28.0                         tibble_3.1.6                                KEGGREST_1.34.0                             generics_0.1.1                              IRanges_2.28.0                             
[37] ellipsis_0.3.2                              cachem_1.0.6                                SummarizedExperiment_1.24.0                 GenomicFeatures_1.46.1                      BiocGenerics_0.40.0                         cli_3.1.0                                  
[43] magrittr_2.0.1                              crayon_1.4.2                                memoise_2.0.0                               R.methodsS3_1.8.1                           fs_1.5.0                                    fansi_0.5.0                                
[49] xml2_1.3.2                                  tools_4.1.1                                 prettyunits_1.1.1                           hms_1.1.1                                   BiocIO_1.4.0                                gargle_1.2.0                               
[55] lifecycle_1.0.1                             matrixStats_0.61.0                          stringr_1.4.0                               S4Vectors_0.32.2                            DelayedArray_0.20.0                         AnnotationDbi_1.56.1                       
[61] Biostrings_2.62.0                           compiler_4.1.1                              GenomeInfoDb_1.30.0                         rlang_0.4.12                                grid_4.1.1                                  RCurl_1.98-1.5                             
[67] VariantAnnotation_1.40.0                    rjson_0.2.20                                rappdirs_0.3.3                              SNPlocs.Hsapiens.dbSNP144.GRCh37_0.99.20    bitops_1.0-7                                restfulr_0.0.13                            
[73] DBI_1.1.1                                   curl_4.3.2                                  R6_2.5.1                                    GenomicAlignments_1.30.0                    dplyr_1.0.7                                 rtracklayer_1.54.0                         
[79] fastmap_1.1.0                               bit_4.0.4                                   utf8_1.2.2                                  filelock_1.0.2                              stringi_1.7.5                               parallel_4.1.1                             
[85] Rcpp_1.0.7                                  vctrs_0.3.8                                 png_0.1-7                                   dbplyr_2.1.1                                tidyselect_1.1.1
Al-Murphy commented 2 years ago

Hey yep it does but this may be a side effect of the change I made regarding case of the sex chromosomes, I'll push that fix to the release version of bioconductor which should go live roughly 2/3 days after. If you could rerun with that version (1.4.5) either by installing directly from the RELEASE_3_15 branch on github or easier, through bioconductor. Let me know if this gives you the same chromosome output as I get

quattro commented 2 years ago

Great will do! I forgot to mention, In case you were wondering, the strange MungeSumstats version (1.5.4.1) in my sessionInfo was the result of me hacking around to make sure the patch I wrote for the lowercase/uppercase X issue was actually installed.

Thanks again.

Al-Murphy commented 2 years ago

I was wondering haha cheers!