RajLabMSSM / echolocatoR

Automated statistical and functional fine-mapping pipeline with extensive API access to datasets.
https://rajlabmssm.github.io/echolocatoR
MIT License
30 stars 11 forks source link

Error in dplyr::slice_max(., order_by = c(symbol, width), n = max_transcripts,...) - slice_max() does not efficiently get the transcript with max width #96

Closed AMCalejandro closed 1 year ago

AMCalejandro commented 2 years ago

1. Bug description

sdplyr::lice_max() messes up the transcripts input even though you are grouping to work arounf previous issues you had. Fortunately this is essay to solve. I am opening a PR

Console output

+-------- Locus Plot: ACP6--------+
[1] "+ Filling NAs in CS cols with 0"
[1] "+ Filling NAs in PP cols with 0"
[1] "+ LD:: LD_matrix detected. Coloring SNPs by LD with lead SNP."
[1] "++ PLOT:: motor_progression track"
[1] "++ PLOT:: Merged fine-mapping track"
[1] "Melting PP and CS from 5 fine-mapping methods"
[1] "++ PLOT:: Adding Gene model track."
Error in dplyr::slice_max(., order_by = c(symbol, width), n = max_transcripts,  :
  Problem while computing indices.
ℹ The error occurred in group 1: symbol = "ACP6".
Caused by error:
! `order_by` must have size 6, not size 12.

2. Reproducible example

This is the input passed to dplyr::slice_max()

data.frame(txdb_transcripts[S4Vectors::subjectHits(hits),]) %>%
    ## !!IMPORTANT!! If unique() isn't applied here,
    ## dplyr will pick duplicates of the same transcript
    unique() %>%
    subset((!is.na(symbol)) & (!is.na(tx_name))) %>%
    dplyr::group_by(symbol) 

Data

# A tibble: 60 × 12
# Groups:   symbol [29]
   seqnames     start       end  width strand tx_id  tx_biotype tx_cds_seq_start
   <fct>        <int>     <int>  <int> <fct>  <chr>  <chr>                 <int>
 1 1        146492879 146596107 103229 -      ENST0… unprocess…               NA
 2 1        146986158 146989699   3542 -      ENST0… lincRNA                  NA
 3 1        147013182 147098017  84836 +      ENST0… protein_c…        147083636
 4 1        147101453 147131116  29664 -      ENST0… protein_c…        147102845
 5 1        146490895 146514599  23705 -      ENST0… unprocess…               NA
 6 1        146626685 146644123  17439 -      ENST0… protein_c…        146631144
 7 1        146630798 146634089   3292 -      ENST0… processed…               NA
 8 1        146631069 146644046  12978 -      ENST0… protein_c…        146631144
 9 1        147228332 147245484  17153 -      ENST0… protein_c…        147230270
10 1        147229306 147232685   3380 -      ENST0… protein_c…        147230270
# … with 50 more rows, and 4 more variables: tx_cds_seq_end <int>,
#   gene_id <chr>, tx_name <chr>, symbol <chr>

3. Session info

``` > sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-conda-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS Matrix products: default BLAS/LAPACK: /home/acarrasco/.conda/envs/echoR/lib/libopenblasp-r0.3.12.so locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] EnsDb.Hsapiens.v75_2.99.0 ensembldb_2.14.1 [3] AnnotationFilter_1.14.0 GenomicFeatures_1.42.3 [5] AnnotationDbi_1.52.0 Biobase_2.50.0 [7] GenomicRanges_1.42.0 GenomeInfoDb_1.26.7 [9] IRanges_2.24.1 S4Vectors_0.28.1 [11] BiocGenerics_0.36.1 Matrix_1.3-2 [13] forcats_0.5.1 stringr_1.4.0 [15] dplyr_1.0.8 purrr_0.3.4 [17] readr_2.1.2 tidyr_1.2.0 [19] tibble_3.1.6 tidyverse_1.3.1 [21] echolocatoR_0.2.2 data.table_1.14.2 [23] ggplot2_3.3.5 loaded via a namespace (and not attached): [1] colorspace_2.0-3 ellipsis_0.3.2 [3] biovizBase_1.38.0 htmlTable_2.4.0 [5] XVector_0.30.0 base64enc_0.1-3 [7] fs_1.5.2 dichromat_2.0-0 [9] rstudioapi_0.13 farver_2.1.0 [11] ggrepel_0.9.1 bit64_4.0.5 [13] fansi_1.0.3 lubridate_1.8.0 [15] xml2_1.3.3 splines_4.0.3 [17] R.methodsS3_1.8.1 cachem_1.0.6 [19] knitr_1.37 Formula_1.2-4 [21] jsonlite_1.8.0 Rsamtools_2.6.0 [23] broom_0.7.12 cluster_2.1.1 [25] dbplyr_2.1.1 png_0.1-7 [27] R.oo_1.24.0 compiler_4.0.3 [29] httr_1.4.2 backports_1.4.1 [31] assertthat_0.2.1 fastmap_1.1.0 [33] lazyeval_0.2.2 cli_3.2.0 [35] prettyunits_1.1.1 htmltools_0.5.2 [37] tools_4.0.3 gtable_0.3.0 [39] glue_1.6.2 GenomeInfoDbData_1.2.4 [41] rappdirs_0.3.3 Rcpp_1.0.8.3 [43] cellranger_1.1.0 vctrs_0.3.8 [45] Biostrings_2.58.0 rtracklayer_1.50.0 [47] xfun_0.30 rvest_1.0.2 [49] lifecycle_1.0.1 XML_3.99-0.6 [51] zlibbioc_1.36.0 scales_1.1.1 [53] BSgenome_1.58.0 VariantAnnotation_1.36.0 [55] ProtGenerics_1.22.0 hms_1.1.1 [57] MatrixGenerics_1.2.1 SummarizedExperiment_1.20.0 [59] RColorBrewer_1.1-2 curl_4.3 [61] memoise_2.0.1 gridExtra_2.3 [63] biomaRt_2.46.3 rpart_4.1-15 [65] latticeExtra_0.6-29 stringi_1.7.6 [67] RSQLite_2.2.11 checkmate_2.0.0 [69] BiocParallel_1.24.1 rlang_1.0.2 [71] pkgconfig_2.0.3 bitops_1.0-7 [73] matrixStats_0.61.0 lattice_0.20-41 [75] GenomicAlignments_1.26.0 htmlwidgets_1.5.4 [77] labeling_0.4.2 bit_4.0.4 [79] tidyselect_1.1.2 magrittr_2.0.2 [81] R6_2.5.1 generics_0.1.2 [83] Hmisc_4.6-0 DelayedArray_0.16.3 [85] DBI_1.1.2 pillar_1.7.0 [87] haven_2.4.3 foreign_0.8-81 [89] withr_2.5.0 survival_3.2-11 [91] RCurl_1.98-1.6 nnet_7.3-15 [93] modelr_0.1.8 crayon_1.5.0 [95] utf8_1.2.2 BiocFileCache_1.14.0 [97] tzdb_0.2.0 progress_1.2.2 [99] jpeg_0.1-9 grid_4.0.3 [101] readxl_1.3.1 blob_1.2.2 [103] reprex_2.0.1 digest_0.6.29 [105] R.utils_2.11.0 openssl_2.0.0 [107] munsell_0.5.0 askpass_1.1 > ```
AMCalejandro commented 2 years ago

Applying this change the transcripts get efficiently added to TRKS$Genes

>  if(gene_track){
    printer("++ PLOT:: Adding Gene model track.",v=verbose)
    try({
      TRKS[["Genes"]] <- PLOT.transcript_model_track(finemap_dat = finemap_dat,
                                                     show.legend = show.legend_genes,
                                                     xtext = xtext,
                                                     max_transcripts = max_transcripts,
                                                     expand_x_mult=NULL,
                                                     verbose=T)
    })
  }
[1] "++ PLOT:: Adding Gene model track."
max_transcripts=1.
8 transcripts from 8 genes returned.
Registered S3 method overwritten by 'GGally':
  method from
  +.gg   ggplot2
Fetching data...OK
Parsing exons...OK
Defining introns...OK
Defining UTRs...OK
Defining CDS...OK
aggregating...
Done
Constructing graphics...
> TRKS
$`GWAS full window`

$motor_progression

$`Fine-mapping`

$Genes
bschilder commented 1 year ago

Now fixed in both master and echoverse branches thanks to PR by @AMCalejandro ! In echoverse, also added the ability to select >1 transcript/gene with max_transcripts arg.