Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

`Could not recognize genome build of: - target_genome` #107

Closed bschilder closed 2 years ago

bschilder commented 2 years ago

Just a note to myself to look into this message that comes up (but doesn't seem to cause an error).

Reprex

Data

Download the file plt.assoc from here: ftp://ftp.sanger.ac.uk/pub/project/humgen/summary_statistics/UKBBblood cell_traits/

Code

res <- MungeSumstats::format_sumstats(
    path = "~/Downloads/plt.assoc",
    log_folder_ind = TRUE,
    log_mungesumstats_msgs = TRUE,
    nThread = 30
    )

Output messages

Reading header.
Tabular format detected.
Importing tabular file: /home/bms20/RDS/project/neurogenomics-lab/live/Data/GWAS_sumstats/Original/Bloodtraits_Vuckovic_2020/plt.assoc
|--------------------------------------------------|
|==================================================|
|--------------------------------------------------|
|==================================================|
Checking for empty columns.
Removing 1 empty columns.
Standardising column headers.
First line of summary statistics file: 
VARIANT ID  CHR CHR_num BP  GENPOS  REF ALT EFFECT  P   MLOG10P SE  ALT_FREQ    ALT_MINOR   MA_FREQ R2  INFO    
Summary statistics report:
   - 41,253,708 rows
   - 41,150,627 unique variants
   - 24,233,459 genome-wide significant variants (P<5e-8)
   - 24 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Inferring genome build.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 10,000 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 157 seconds.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 10,000 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 165 seconds.
Inferred genome build: GRCH37
Checking SNP RSIDs.
297 SNP IDs are not correctly formatted. These will be corrected from the reference genome.
Loading SNPlocs data.
Found  Indels. These won't be checked against the reference genome as it does not contain Indels.
WARNING If your sumstat doesn't contain Indels, set the indel param to FALSE & rerun MungeSumstats::format_sumstats()
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/Vuckovic2020.plt/logs/snp_not_found_from_chr_bp.tsv.gz
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/Vuckovic2020.plt/logs/snp_multi_colon.tsv.gz
2,691,049 SNP IDs appear to be made up of chr:bp, these will be replaced by their SNP ID from the reference genome
Loading SNPlocs data.
Found  Indels. These won't be checked against the reference genome as it does not contain Indels.
WARNING If your sumstat doesn't contain Indels, set the indel param to FALSE & rerun MungeSumstats::format_sumstats()
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/Vuckovic2020.plt/logs/snp_not_found_from_bp_chr.tsv.gz
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 39,541,871 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 373 seconds.
Found 1,196,345 Indels. These won't be checked against the reference genome as it does not contain Indels.
WARNING If your sumstat doesn't contain Indels, set the indel param to FALSE & rerun MungeSumstats::format_sumstats()
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 30,677 SNPs where neither A1 nor A2 match the reference genome. These will be removed.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/Vuckovic2020.plt/logs/alleles_dont_match_ref_gen.tsv.gz
There are 1,084 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
0 p-values are <=5e-324 which LDSC/MAGMA may not be able to handle. These will be converted to 0.
Checking for duplicate SNPs from SNP ID.
91,371 RSIDs are duplicated in the sumstats file. These duplicates will be removed
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/Vuckovic2020.plt/logs/dup_snp_id.tsv.gz
Checking for SNPs with duplicated base-pair positions.
Found  Indels. These won't be checked for duplicates based on base-pair position as there can be multiples.
WARNING If your sumstat doesn't contain Indels, set the indel param to FALSE & rerun MungeSumstats::format_sumstats()
Checking for duplicated rows.
Filtering SNPs based on INFO score.
24,836,295 SNPs are below the INFO threshold of 0.9 and will be removed.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/Vuckovic2020.plt/logs/info_filter.tsv.gz
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y CHR uppercase.
313,022 SNPs are on chromosomes X, Y, MT and will be removed
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/Vuckovic2020.plt/logs/chr_excl.tsv.gz
Checking for bi-allelic SNPs.
5,149,120 SNPs are non-biallelic. These will be removed.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/Vuckovic2020.plt/logs/snp_bi_allelic.tsv.gz
Warning: When method is an integer, must be >0.
422,016 SNPs (5.3%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted  summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Could not recognize genome build of:
 - target_genome
These will be inferred from the data.
Sorting coordinates.
Writing in tabular format ==> /shared/bms20/projects/MAGMA_Files_Public/data/GWAS_sumstats/Vuckovic2020.plt/Vuckovic2020.plt.tsv.gz
Summary statistics report:
   - 8,001,034 rows (19.4% of original 41,253,708 rows)
   - 8,001,034 unique variants
   - 57,215 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Successfully finished preparing sumstats file, preview:
Reading header.
           SNP CHR     BP A1 A2      VARIANT CHR_NUM     GENPOS         BETA     P     MLOG10P
1: rs570371753   1  58396  T  C  1:58396_T_C       1 0.00000000 -0.000630553 0.990 0.003279409
2: rs557418932   1  79192  T  G  1:79192_T_G       1 0.00000000  0.222863000 0.071 1.146258745
3: rs554639997   1  91588  G  A  1:91588_G_A       1 0.00000000  0.012562300 0.880 0.055973148
4: rs117217250   1 713979  C  G 1:713979_C_G       1 0.00390656  0.007363990 0.930 0.032350752
          SE      FRQ ALT_MINOR  MA_FREQ           R2     INFO
1: 0.0668779 0.000246      TRUE 0.000246 1.955696e-10 0.914143
2: 0.1236150 0.000077      TRUE 0.000077 7.648270e-06 0.921351
3: 0.0825709 0.000163      TRUE 0.000163 5.143812e-08 0.962859
4: 0.0817412 0.000175      TRUE 0.000175 1.897660e-08 0.935276

Session info

``` R Under development (unstable) (2022-02-25 r81808) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS Matrix products: default BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 LC_ADDRESS=en_US.UTF-8 [10] LC_TELEPHONE=en_US.UTF-8 LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] backports_1.4.1 AnnotationHub_3.5.0 BiocFileCache_2.5.0 [4] plyr_1.8.7 googleAuthR_2.0.0 lazyeval_0.2.2 [7] splines_4.2.0 orthogene_1.3.1 ewceData_1.5.0 [10] BiocParallel_1.31.10 GenomeInfoDb_1.33.3 ggplot2_3.3.6 [13] digest_0.6.29 yulab.utils_0.0.5 htmltools_0.5.3 [16] RNOmni_1.0.0 fansi_1.0.3 magrittr_2.0.3 [19] memoise_2.0.1 BSgenome_1.65.2 xlsx_0.6.5 [22] limma_3.53.4 Biostrings_2.65.1 matrixStats_0.62.0 [25] R.utils_2.12.0 prettyunits_1.1.1 colorspace_2.0-3 [28] blob_1.2.3 rappdirs_0.3.3 xfun_0.31 [31] dplyr_1.0.9 crayon_1.5.1 RCurl_1.98-1.7 [34] jsonlite_1.8.0 lme4_1.1-30 VariantAnnotation_1.43.2 [37] MAGMA.Celltyping_2.0.4 ape_5.6-2 glue_1.6.2 [40] gtable_0.3.0 gargle_1.2.0 zlibbioc_1.43.0 [43] XVector_0.37.0 HGNChelper_0.8.1 DelayedArray_0.23.0 [46] car_3.1-0 SingleCellExperiment_1.19.0 BiocGenerics_0.43.0 [49] abind_1.4-5 scales_1.2.0 DBI_1.1.3 [52] rstatix_0.7.0 Rcpp_1.0.9 viridisLite_0.4.0 [55] xtable_1.8-4 progress_1.2.2 tidytree_0.3.9 [58] gridGraphics_0.5-1 bit_4.0.4 stats4_4.2.0 [61] htmlwidgets_1.5.4 httr_1.4.3 ellipsis_0.3.2 [64] pkgconfig_2.0.3 XML_3.99-0.10 rJava_1.0-6 [67] R.methodsS3_1.8.2 dbplyr_2.2.1 utf8_1.2.2 [70] here_1.0.1 reshape2_1.4.4 ggplotify_0.1.0 [73] tidyselect_1.1.2 rlang_1.0.4 later_1.3.0 [76] AnnotationDbi_1.59.1 munsell_0.5.0 BiocVersion_3.16.0 [79] tools_4.2.0 cachem_1.0.6 cli_3.3.0 [82] generics_0.1.3 RSQLite_2.2.15 ExperimentHub_2.5.0 [85] MungeSumstats_1.5.5 broom_1.0.0 ggdendro_0.1.23 [88] evaluate_0.15 stringr_1.4.0 fastmap_1.1.0 [91] yaml_2.3.5 ggtree_3.5.1 babelgene_22.3 [94] knitr_1.39 bit64_4.0.5 fs_1.5.2 [97] purrr_0.3.4 KEGGREST_1.37.3 gprofiler2_0.2.1 [100] nlme_3.1-158 mime_0.12 R.oo_1.25.0 [103] aplot_0.1.6 xml2_1.3.3 biomaRt_2.53.2 [106] compiler_4.2.0 rstudioapi_0.13 plotly_4.10.0 [109] filelock_1.0.2 curl_4.3.2 png_0.1-7 [112] interactiveDisplayBase_1.35.0 ggsignif_0.6.3 treeio_1.21.0 [115] tibble_3.1.8 EWCE_1.5.5 homologene_1.4.68.19.3.27 [118] stringi_1.7.8 GenomicFeatures_1.49.5 lattice_0.20-45 [121] Matrix_1.4-1 nloptr_2.0.3 vctrs_0.4.1 [124] pillar_1.8.0 lifecycle_1.0.1 BiocManager_1.30.18 [127] data.table_1.14.2 bitops_1.0-7 patchwork_1.1.1 [130] httpuv_1.6.5 rtracklayer_1.57.0 GenomicRanges_1.49.0 [133] R6_2.5.1 BiocIO_1.7.1 promises_1.2.0.1 [136] gridExtra_2.3 IRanges_2.31.0 codetools_0.2-18 [139] boot_1.3-28 MASS_7.3-58 assertthat_0.2.1 [142] SummarizedExperiment_1.27.1 xlsxjars_0.6.1 rprojroot_2.0.3 [145] rjson_0.2.21 GenomicAlignments_1.33.0 Rsamtools_2.13.3 [148] S4Vectors_0.35.1 GenomeInfoDbData_1.2.8 parallel_4.2.0 [151] hms_1.1.1 ggfun_0.0.6 grid_4.2.0 [154] minqa_1.2.4 tidyr_1.2.0 rmarkdown_2.14 [157] MatrixGenerics_1.9.1 carData_3.0-5 ggpubr_0.4.0 [160] Biobase_2.57.1 shiny_1.7.2 restfulr_0.0.15 ```
bschilder commented 2 years ago

Ok, this is indeed triggered by liftover() here: https://github.com/neurogenomics/MungeSumstats/blob/021f36322101e8b5dbc796fd03a2fad3795f4dae/R/format_sumstats.R#L896

I believe this is happening because the target genome build is being supplied as NULL. This corresponds to the convert_ref_genome argument.

I guess one solution would be to have an early check inn format_sumstats for the convert_ref_genome parameter and when it's NULL, set it to some genome ref version and send a message to the user.

Al-Murphy commented 2 years ago

Found the issue - I just needed to add a check to see if the user even wanted the genome build converted before running the checks to see if it needs to be inferred. This is updated in 1.5.6 (will push today)