Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

format_sumstats error "Can't assign to the same column twice in the same query (duplicates detected)" #110

Closed kisudsoe closed 2 years ago

kisudsoe commented 2 years ago

Bug description

Hi developer,

I ran the format_sumstats for the regular plink2 result, but it yelled an error "Can't assign to the same column twice in the same query (duplicates detected)". I assume that the data.table function triggered the error.

Hope to get any advice from you, thank you!

Console output

reformatted <- MungeSumstats::format_sumstats(path = sumstat_path, ref_genome = "GRCh37", save_path=file.path('sumstat_munge',out_path))
Formatted summary statistics will be saved to ==>  sumstat_munge/merge-fid2714_Age_when_periods_started.glm.linear.gztxt.gz
Reading header.
Tabular format detected.
Importing tabular file: sumstat/merge-fid2714_Age_when_periods_started.glm.linear.gz
|--------------------------------------------------|
|==================================================|
Checking for empty columns.
Removing 4 empty columns.
Error in `[.data.table`(sumstats_dt, , `:=`((names(empty_cols)), NULL)) : 
  Can't assign to the same column twice in the same query (duplicates detected).

Here is my plink2 result header.

> head(stat)
  CHROM   POS          ID REF ALT A1 TEST OBS_CT       BETA       SE    T_STAT
1    10 60684 rs569167217   A   C  C  ADD  47865 -0.0281776 0.113878 -0.247436
2    10 61331 rs548639866   A   G  G  ADD  47824 -0.0301179 0.113883 -0.264464
3    10 61419 rs553163044   G   A  A  ADD  37099  0.3499160 0.140269  2.494600
4    10 63213 rs542543788   G   C  C  ADD  47865 -0.0281776 0.113878 -0.247436
5    10 64869 rs556434813   C   A  A  ADD  48272 -0.0743172 0.163170 -0.455460
6    10 64972 rs868964520   G   A  A  ADD  48248  0.0526655 0.305585  0.172343
          P
1 0.8045720
2 0.7914230
3 0.0126143
4 0.8045720
5 0.6487810
6 0.8631690

Session info

(Add output of the R function utils::sessionInfo() below. This helps us assess version/OS conflicts which could be causing bugs.)

``` R version 4.2.1 (2022-06-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] MatrixGenerics_1.8.1 [2] Biobase_2.56.0 [3] httr_1.4.3 [4] BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1 [5] bit64_4.0.5 [6] jsonlite_1.8.0 [7] R.utils_2.12.0 [8] assertthat_0.2.1 [9] stats4_4.2.1 [10] BiocFileCache_2.4.0 [11] blob_1.2.3 [12] BSgenome_1.64.0 [13] GenomeInfoDbData_1.2.8 [14] Rsamtools_2.12.0 [15] yaml_2.3.5 [16] progress_1.2.2 [17] pillar_1.8.0 [18] RSQLite_2.2.15 [19] lattice_0.20-45 [20] glue_1.6.2 [21] digest_0.6.29 [22] GenomicRanges_1.48.0 [23] XVector_0.36.0 [24] googleAuthR_2.0.0 [25] Matrix_1.4-1 [26] R.oo_1.25.0 [27] XML_3.99-0.10 [28] pkgconfig_2.0.3 [29] biomaRt_2.52.0 [30] zlibbioc_1.42.0 [31] purrr_0.3.4 [32] BiocParallel_1.30.3 [33] tibble_3.1.8 [34] KEGGREST_1.36.3 [35] generics_0.1.3 [36] IRanges_2.30.0 [37] ellipsis_0.3.2 [38] cachem_1.0.6 [39] SummarizedExperiment_1.26.1 [40] GenomicFeatures_1.48.3 [41] BiocGenerics_0.42.0 [42] cli_3.3.0 [43] magrittr_2.0.3 [44] crayon_1.5.1 [45] memoise_2.0.1 [46] R.methodsS3_1.8.2 [47] fs_1.5.2 [48] fansi_1.0.3 [49] xml2_1.3.3 [50] tools_4.2.1 [51] data.table_1.14.2 [52] prettyunits_1.1.1 [53] hms_1.1.1 [54] BiocIO_1.6.0 [55] gargle_1.2.0 [56] lifecycle_1.0.1 [57] matrixStats_0.62.0 [58] stringr_1.4.0 [59] S4Vectors_0.34.0 [60] DelayedArray_0.22.0 [61] AnnotationDbi_1.58.0 [62] Biostrings_2.64.0 [63] compiler_4.2.1 [64] GenomeInfoDb_1.32.2 [65] rlang_1.0.4 [66] grid_4.2.1 [67] RCurl_1.98-1.7 [68] VariantAnnotation_1.42.1 [69] rjson_0.2.21 [70] rappdirs_0.3.3 [71] SNPlocs.Hsapiens.dbSNP144.GRCh37_0.99.20 [72] bitops_1.0-7 [73] restfulr_0.0.15 [74] codetools_0.2-18 [75] DBI_1.1.3 [76] curl_4.3.2 [77] R6_2.5.1 [78] GenomicAlignments_1.32.1 [79] dplyr_1.0.9 [80] rtracklayer_1.56.1 [81] fastmap_1.1.0 [82] bit_4.0.4 [83] utf8_1.2.2 [84] filelock_1.0.2 [85] MungeSumstats_1.4.5 [86] stringi_1.7.8 [87] parallel_4.2.1 [88] Rcpp_1.0.9 [89] vctrs_0.4.1 [90] png_0.1-7 [91] dbplyr_2.2.1 [92] tidyselect_1.1.2 ```
Al-Murphy commented 2 years ago

Hey, I can't seem to replicate the error which I believe might be due to the input file format I'm using - I just copied the sample data you have above into a txt file. Can you attach a small subset file that gives the same issue?

kisudsoe commented 2 years ago

Hi Murphy, please find attached test.txt.gz. I checked this file gives the same issue. Thank you!

Al-Murphy commented 2 years ago

Hey,

I don't get any error when using version 1.5.6 (current master branch from Github). Could you try installing this latest version to see if the issue resolves for you too? This version will migrate to the Bioconductor master branch soon anyway.

Cheers, Alan.

Al-Murphy commented 2 years ago

Closing for now, re-open if the new version doesn't solve your issue.

Thanks, Alan.

PhoebeGuo97 commented 2 years ago

Hello Alan,

I have the same issue even if I use the 1.5.12 version of MungeSumstats.

Al-Murphy commented 2 years ago

Hey, can you create a new issue with sample data and the message log on this just so it's easier to track?

Thanks, Alan.