Closed lotard closed 3 months ago
I refined it further - if I use the perfect matrix on a variant which inverts the binding site (pinv
here) I get pctAlt 4/6. The same if inversion has one mismatch (ipinv
). This is running motifbreakR with no filters (filterp=F, threshold=0,show.neutral=T)
So even though a negative strand scan is performed (scoreSeqWindows
function uses both the motif and reverse complement of the motif) it somehow doesn't get reported?
In any case, what I'd expect is:
Could you please share the objects that you are using, as an .rda file, and I'll load them up and see where the problem is arising in the code.
Thanks for looking into it. Here's a reprex R code and .rda of it for the second case [inversion] (will say it failed to load MotifDb due to timelimit, but loads objects up fine). reprex_pinv.zip
Here's reprex and .rda for the first case (blocks), except I used 100% perfect matrix (instead of 97%), so it's same in both. Doesn't really affect results anyway. reprex_blocks.zip
R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Ventura 13.2.1
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] grid stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] BSgenome.Hsapiens.UCSC.hg38_1.4.5 BSgenome_1.66.3 rtracklayer_1.58.0
[4] motifbreakR_2.12.3 MotifDb_1.40.0 Biostrings_2.66.0
[7] XVector_0.38.0 GenomicRanges_1.50.2 GenomeInfoDb_1.34.9
[10] IRanges_2.32.0 S4Vectors_0.36.2 BiocGenerics_0.44.0
[13] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
[16] dplyr_1.1.0 purrr_1.0.1 readr_2.1.4
[19] tidyr_1.3.0 tibble_3.2.0 ggplot2_3.4.1
[22] tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] backports_1.4.1 Hmisc_5.0-1 BiocFileCache_2.6.1 plyr_1.8.8
[5] lazyeval_0.2.2 BiocParallel_1.32.5 TFBSTools_1.36.0 digest_0.6.31
[9] ensembldb_2.22.0 htmltools_0.5.4 GO.db_3.16.0 fansi_1.0.4
[13] magrittr_2.0.3 checkmate_2.1.0 memoise_2.0.1 cluster_2.1.4
[17] tzdb_0.3.0 annotate_1.76.0 matrixStats_0.63.0 R.utils_2.12.2
[21] timechange_0.2.0 prettyunits_1.1.1 jpeg_0.1-10 colorspace_2.1-0
[25] blob_1.2.3 rappdirs_0.3.3 xfun_0.37 crayon_1.5.2
[29] RCurl_1.98-1.10 TFMPvalue_0.0.9 VariantAnnotation_1.44.1 glue_1.6.2
[33] gtable_0.3.1 zlibbioc_1.44.0 DelayedArray_0.24.0 scales_1.2.1
[37] DBI_1.1.3 Rcpp_1.0.10 xtable_1.8-4 progress_1.2.2
[41] htmlTable_2.4.1 foreign_0.8-84 bit_4.0.5 Formula_1.2-5
[45] htmlwidgets_1.6.1 httr_1.4.5 RColorBrewer_1.1-3 ellipsis_0.3.2
[49] R.methodsS3_1.8.2 pkgconfig_2.0.3 XML_3.99-0.13 Gviz_1.42.1
[53] nnet_7.3-18 dbplyr_2.3.1 deldir_1.0-6 utf8_1.2.3
[57] tidyselect_1.2.0 rlang_1.0.6 reshape2_1.4.4 AnnotationDbi_1.60.2
[61] munsell_0.5.0 tools_4.2.2 cachem_1.0.7 cli_3.6.0
[65] DirichletMultinomial_1.40.0 generics_0.1.3 RSQLite_2.3.0 ade4_1.7-22
[69] evaluate_0.20 fastmap_1.1.1 yaml_2.3.7 knitr_1.42
[73] bit64_4.0.5 caTools_1.18.2 KEGGREST_1.38.0 AnnotationFilter_1.22.0
[77] splitstackshape_1.4.8 R.oo_1.25.0 poweRlaw_0.70.6 pracma_2.4.2
[81] xml2_1.3.3 biomaRt_2.54.0 compiler_4.2.2 rstudioapi_0.14
[85] filelock_1.0.2 curl_5.0.0 png_0.1-8 stringi_1.7.12
[89] GenomicFeatures_1.50.4 lattice_0.20-45 ProtGenerics_1.30.0 CNEr_1.34.0
[93] Matrix_1.5-3 vctrs_0.5.2 pillar_1.8.1 lifecycle_1.0.3
[97] data.table_1.14.8 bitops_1.0-7 R6_2.5.1 BiocIO_1.8.0
[101] latticeExtra_0.6-30 gridExtra_2.3 motifStack_1.42.0 codetools_0.2-19
[105] dichromat_2.0-0.1 MASS_7.3-58.3 gtools_3.9.4 seqLogo_1.64.0
[109] SummarizedExperiment_1.28.0 rjson_0.2.21 withr_2.5.0 GenomicAlignments_1.34.1
[113] Rsamtools_2.14.0 GenomeInfoDbData_1.2.9 parallel_4.2.2 hms_1.1.2
[117] rpart_4.1.19 rmarkdown_2.20 MatrixGenerics_1.10.0 biovizBase_1.46.0
[121] Biobase_2.58.0 base64enc_0.1-3 interp_1.1-3 restfulr_0.0.15
Any thoughts @Simon-Coetzee ? Happy to test it if you have some updates soon, or do some extra sleuthing if that'll be helpful, otherwise I will go ahead and use the current version. Appreciate your work!
I think I've figured it out. The code as it exists relies upon the motif rows being in the specific order A, C, G, T, and your example matrix was in the form A, C, T, G. MotifDb has it this way (alphabetical) and it allows me to just flip the pwm matrix to do the reverse complement. However, this is not enforced or referenced anywhere, and so in cases like this it produces unexpected outcomes. I'll see what I can do it check, or enforce, or create this ordering in all pwms.
Checked it, you're completely right. Really my fault to be honest, I have thought about the order of the matrix and somehow got it wrong anyway. Sorry for using up your time, it's certainly nothing that will affect most of the users and the software is not really advertised as 'plug in your own matrix'.
I guess the only question that remains is whether inversion of a motif (or for that matter shift in position) should be treated as 'neutral' mutation if it results in the same or v similar match score (as it is now). Some motifs are pretty much palindromic, but others are very directional, e.g. CTCF, and their inversion could definitely have expression consequences. Perhaps outputting altMotifPos with indication of strand could be a feature in the future, that would allow the user to investigate that if desired without creating a backcompatibility issue (just an extra column).
Otherwise, this issue is resolved for me, feel free to close as you see fit!
pctAlt should simply be the min-max scaled binding score. I've ran a simple test whose pctAlt results are different from expected. This makes me wonder how exacly motifbreakR finds the motif in mutated/alt sequence.
I've made this fake perfect binding matrix (CACCTG/rc:CAGGTG):
And I've run motifbreakR on two different sets of block mutations that affect exactly half of the perfect motif (3/6 nt).
As far as I understand, motifBreakeR performs a motif scan of both strands of the sequence within the limits determined by the variants. If so, then I think pctAlt results should be 4/6, 4/6, 5/6, 4/6 (see below), but the motifbreakR says 5/6, 4/6, 3/6 and 4/6 (0.833, 0.666, 0.5, 0.666). In other words, sometimes I can find a better match, sometimes I can't find the match it seems to find.
None of this is affected by scoring method in this simple matrix instance (I checked log and ic), reference matches and motifPos are perfect.
Specifically, I'm wondering:
Sorry if this is an exhaustive question and it's fair to say 'I don't have time to develop for this specific problem'. It may also be that (somehow) solving #49 will solve this too.
See details of matches below: