Simon-Coetzee / motifBreakR

A Package For Predicting The Disruptiveness Of Single Nucleotide Polymorphisms On Transcription Factor Binding Sites.
27 stars 12 forks source link

unexpected pctAlt #50

Closed lotard closed 3 months ago

lotard commented 1 year ago

pctAlt should simply be the min-max scaled binding score. I've ran a simple test whose pctAlt results are different from expected. This makes me wonder how exacly motifbreakR finds the motif in mutated/alt sequence.

I've made this fake perfect binding matrix (CACCTG/rc:CAGGTG): image

And I've run motifbreakR on two different sets of block mutations that affect exactly half of the perfect motif (3/6 nt). image image

As far as I understand, motifBreakeR performs a motif scan of both strands of the sequence within the limits determined by the variants. If so, then I think pctAlt results should be 4/6, 4/6, 5/6, 4/6 (see below), but the motifbreakR says 5/6, 4/6, 3/6 and 4/6 (0.833, 0.666, 0.5, 0.666). In other words, sometimes I can find a better match, sometimes I can't find the match it seems to find.

None of this is affected by scoring method in this simple matrix instance (I checked log and ic), reference matches and motifPos are perfect.

Specifically, I'm wondering:

Sorry if this is an exhaustive question and it's fair to say 'I don't have time to develop for this specific problem'. It may also be that (somehow) solving #49 will solve this too.

See details of matches below:

image
lotard commented 1 year ago

I refined it further - if I use the perfect matrix on a variant which inverts the binding site (pinv here) I get pctAlt 4/6. The same if inversion has one mismatch (ipinv). This is running motifbreakR with no filters (filterp=F, threshold=0,show.neutral=T)

image

So even though a negative strand scan is performed (scoreSeqWindows function uses both the motif and reverse complement of the motif) it somehow doesn't get reported?

In any case, what I'd expect is:

Simon-Coetzee commented 1 year ago

Could you please share the objects that you are using, as an .rda file, and I'll load them up and see where the problem is arising in the code.

lotard commented 1 year ago

Thanks for looking into it. Here's a reprex R code and .rda of it for the second case [inversion] (will say it failed to load MotifDb due to timelimit, but loads objects up fine). reprex_pinv.zip

lotard commented 1 year ago

Here's reprex and .rda for the first case (blocks), except I used 100% perfect matrix (instead of 97%), so it's same in both. Doesn't really affect results anyway. reprex_blocks.zip

lotard commented 1 year ago
R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Ventura 13.2.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg38_1.4.5 BSgenome_1.66.3                   rtracklayer_1.58.0               
 [4] motifbreakR_2.12.3                MotifDb_1.40.0                    Biostrings_2.66.0                
 [7] XVector_0.38.0                    GenomicRanges_1.50.2              GenomeInfoDb_1.34.9              
[10] IRanges_2.32.0                    S4Vectors_0.36.2                  BiocGenerics_0.44.0              
[13] lubridate_1.9.2                   forcats_1.0.0                     stringr_1.5.0                    
[16] dplyr_1.1.0                       purrr_1.0.1                       readr_2.1.4                      
[19] tidyr_1.3.0                       tibble_3.2.0                      ggplot2_3.4.1                    
[22] tidyverse_2.0.0                  

loaded via a namespace (and not attached):
  [1] backports_1.4.1             Hmisc_5.0-1                 BiocFileCache_2.6.1         plyr_1.8.8                 
  [5] lazyeval_0.2.2              BiocParallel_1.32.5         TFBSTools_1.36.0            digest_0.6.31              
  [9] ensembldb_2.22.0            htmltools_0.5.4             GO.db_3.16.0                fansi_1.0.4                
 [13] magrittr_2.0.3              checkmate_2.1.0             memoise_2.0.1               cluster_2.1.4              
 [17] tzdb_0.3.0                  annotate_1.76.0             matrixStats_0.63.0          R.utils_2.12.2             
 [21] timechange_0.2.0            prettyunits_1.1.1           jpeg_0.1-10                 colorspace_2.1-0           
 [25] blob_1.2.3                  rappdirs_0.3.3              xfun_0.37                   crayon_1.5.2               
 [29] RCurl_1.98-1.10             TFMPvalue_0.0.9             VariantAnnotation_1.44.1    glue_1.6.2                 
 [33] gtable_0.3.1                zlibbioc_1.44.0             DelayedArray_0.24.0         scales_1.2.1               
 [37] DBI_1.1.3                   Rcpp_1.0.10                 xtable_1.8-4                progress_1.2.2             
 [41] htmlTable_2.4.1             foreign_0.8-84              bit_4.0.5                   Formula_1.2-5              
 [45] htmlwidgets_1.6.1           httr_1.4.5                  RColorBrewer_1.1-3          ellipsis_0.3.2             
 [49] R.methodsS3_1.8.2           pkgconfig_2.0.3             XML_3.99-0.13               Gviz_1.42.1                
 [53] nnet_7.3-18                 dbplyr_2.3.1                deldir_1.0-6                utf8_1.2.3                 
 [57] tidyselect_1.2.0            rlang_1.0.6                 reshape2_1.4.4              AnnotationDbi_1.60.2       
 [61] munsell_0.5.0               tools_4.2.2                 cachem_1.0.7                cli_3.6.0                  
 [65] DirichletMultinomial_1.40.0 generics_0.1.3              RSQLite_2.3.0               ade4_1.7-22                
 [69] evaluate_0.20               fastmap_1.1.1               yaml_2.3.7                  knitr_1.42                 
 [73] bit64_4.0.5                 caTools_1.18.2              KEGGREST_1.38.0             AnnotationFilter_1.22.0    
 [77] splitstackshape_1.4.8       R.oo_1.25.0                 poweRlaw_0.70.6             pracma_2.4.2               
 [81] xml2_1.3.3                  biomaRt_2.54.0              compiler_4.2.2              rstudioapi_0.14            
 [85] filelock_1.0.2              curl_5.0.0                  png_0.1-8                   stringi_1.7.12             
 [89] GenomicFeatures_1.50.4      lattice_0.20-45             ProtGenerics_1.30.0         CNEr_1.34.0                
 [93] Matrix_1.5-3                vctrs_0.5.2                 pillar_1.8.1                lifecycle_1.0.3            
 [97] data.table_1.14.8           bitops_1.0-7                R6_2.5.1                    BiocIO_1.8.0               
[101] latticeExtra_0.6-30         gridExtra_2.3               motifStack_1.42.0           codetools_0.2-19           
[105] dichromat_2.0-0.1           MASS_7.3-58.3               gtools_3.9.4                seqLogo_1.64.0             
[109] SummarizedExperiment_1.28.0 rjson_0.2.21                withr_2.5.0                 GenomicAlignments_1.34.1   
[113] Rsamtools_2.14.0            GenomeInfoDbData_1.2.9      parallel_4.2.2              hms_1.1.2                  
[117] rpart_4.1.19                rmarkdown_2.20              MatrixGenerics_1.10.0       biovizBase_1.46.0          
[121] Biobase_2.58.0              base64enc_0.1-3             interp_1.1-3                restfulr_0.0.15
lotard commented 1 year ago

Any thoughts @Simon-Coetzee ? Happy to test it if you have some updates soon, or do some extra sleuthing if that'll be helpful, otherwise I will go ahead and use the current version. Appreciate your work!

Simon-Coetzee commented 1 year ago

I think I've figured it out. The code as it exists relies upon the motif rows being in the specific order A, C, G, T, and your example matrix was in the form A, C, T, G. MotifDb has it this way (alphabetical) and it allows me to just flip the pwm matrix to do the reverse complement. However, this is not enforced or referenced anywhere, and so in cases like this it produces unexpected outcomes. I'll see what I can do it check, or enforce, or create this ordering in all pwms.

lotard commented 1 year ago

Checked it, you're completely right. Really my fault to be honest, I have thought about the order of the matrix and somehow got it wrong anyway. Sorry for using up your time, it's certainly nothing that will affect most of the users and the software is not really advertised as 'plug in your own matrix'.

I guess the only question that remains is whether inversion of a motif (or for that matter shift in position) should be treated as 'neutral' mutation if it results in the same or v similar match score (as it is now). Some motifs are pretty much palindromic, but others are very directional, e.g. CTCF, and their inversion could definitely have expression consequences. Perhaps outputting altMotifPos with indication of strand could be a feature in the future, that would allow the user to investigate that if desired without creating a backcompatibility issue (just an extra column).

Otherwise, this issue is resolved for me, feel free to close as you see fit!