Bioconductor / Biostrings

Efficient manipulation of biological strings
https://bioconductor.org/packages/Biostrings
57 stars 16 forks source link

vmatchpattern() with max.mismatch= 1 gives less results than with max.mismatch =0 #69

Closed LN-rich closed 2 years ago

LN-rich commented 2 years ago

I googled it and found another incidence describing the same kind of problem I have: https://github.com/benjjneb/dada2/issues/1276 I am using the NestLink twoPatternReadFilter() function and as stated, I get more results when I have max.mismatches =0 than with any other number. Even with a max.mismatch as long as the pattern I get less results. Intrestingly the same amount as with max.mismatch =1 . Is it possible that the choice of the alternative algorithm compared to the one chosen with max.mismatch =0 raises this error?

hpages commented 2 years ago

Can you please provide a reproducible example? Thanks

hpages commented 2 years ago

Are you going to follow up on this? We can't help without a reproducible example.

LN-rich commented 2 years ago

I‘d love to but unfortunately I can’t without sharing unpublished data, so I cannot provide a reproducible example. I’ll let You if I manage to debug myself.

On 8. Sep 2022, at 19:08, Hervé Pagès @.***> wrote:



Are you going to follow up on this? We can't help without a reproducible example.

— Reply to this email directly, view it on GitHubhttps://github.com/Bioconductor/Biostrings/issues/69#issuecomment-1240988146, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AUHZQR7W7JJZSQNJTSTFJE3V5IMQBANCNFSM5YAUFSAQ. You are receiving this because you authored the thread.Message ID: @.***>

hpages commented 2 years ago

Well, here's one for you, and it has nothing to do with "vmatchpattern() with max.mismatch= 1 gives less results than with max.mismatch=0" (you have no evidence of that):

library(NestLink)
reads <- DNAStringSet(c("GGACTGGGTTTTT", "ACTGGGGGACTGTTTTT","ACCCTGGGTTT"))

twoPatternReadFilter(reads, "GGACTG", "TTT", maxMismatch=0)
# $reads
# DNAStringSet object of length 2:
#     width seq
# [1]    13 GGACTGGGTTTTT
# [2]    17 ACTGGGGGACTGTTTTT
# 
# $patternPositions
#   leftStart1 leftEnd1 rightStart2 rightEnd2
# 1          1        6           9        11
# 2          7       12          13        15

twoPatternReadFilter(reads, "GGACTG", "TTT", maxMismatch=2)
# $reads
# DNAStringSet object of length 0
# 
# $patternPositions
# [1] leftStart1  leftEnd1    rightStart2 rightEnd2  
# <0 rows> (or 0-length row.names)

So it looks like NestLink::twoPatternReadFilter() uses some questionable logic to keep or drop reads. That's something you would need to discuss with the NestLink folks.

sessionInfo() ``` R version 4.2.0 (2022-04-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.1 LTS Matrix products: default BLAS: /home/hpages/R/R-4.2.0/lib/libRblas.so LAPACK: /home/hpages/R/R-4.2.0/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] NestLink_1.12.0 ShortRead_1.54.0 [3] GenomicAlignments_1.32.1 SummarizedExperiment_1.26.1 [5] Biobase_2.56.0 MatrixGenerics_1.8.1 [7] matrixStats_0.62.0 Rsamtools_2.12.0 [9] GenomicRanges_1.48.0 BiocParallel_1.30.3 [11] protViz_0.7.3 gplots_3.1.3 [13] Biostrings_2.64.1 GenomeInfoDb_1.32.4 [15] XVector_0.36.0 IRanges_2.30.1 [17] S4Vectors_0.34.0 ExperimentHub_2.4.0 [19] AnnotationHub_3.4.0 BiocFileCache_2.4.0 [21] dbplyr_2.2.1 BiocGenerics_0.42.0 loaded via a namespace (and not attached): [1] httr_1.4.4 bit64_4.0.5 [3] gtools_3.9.3 shiny_1.7.2 [5] assertthat_0.2.1 interactiveDisplayBase_1.34.0 [7] BiocManager_1.30.18 latticeExtra_0.6-30 [9] blob_1.2.3 GenomeInfoDbData_1.2.8 [11] yaml_2.3.5 BiocVersion_3.15.2 [13] pillar_1.8.1 RSQLite_2.2.17 [15] lattice_0.20-45 glue_1.6.2 [17] digest_0.6.29 RColorBrewer_1.1-3 [19] promises_1.2.0.1 htmltools_0.5.3 [21] httpuv_1.6.6 Matrix_1.5-1 [23] pkgconfig_2.0.3 zlibbioc_1.42.0 [25] purrr_0.3.4 xtable_1.8-4 [27] jpeg_0.1-9 later_1.3.0 [29] tibble_3.1.8 KEGGREST_1.36.3 [31] generics_0.1.3 ellipsis_0.3.2 [33] cachem_1.0.6 cli_3.4.0 [35] deldir_1.0-6 magrittr_2.0.3 [37] crayon_1.5.1 mime_0.12 [39] memoise_2.0.1 fansi_1.0.3 [41] hwriter_1.3.2.1 tools_4.2.0 [43] lifecycle_1.0.2 interp_1.1-3 [45] DelayedArray_0.22.0 AnnotationDbi_1.58.0 [47] compiler_4.2.0 caTools_1.18.2 [49] rlang_1.0.5 grid_4.2.0 [51] RCurl_1.98-1.8 rappdirs_0.3.3 [53] bitops_1.0-7 codetools_0.2-18 [55] DBI_1.1.3 curl_4.3.2 [57] R6_2.5.1 dplyr_1.0.10 [59] fastmap_1.1.0 bit_4.0.4 [61] utf8_1.2.2 filelock_1.0.2 [63] KernSmooth_2.23-20 parallel_4.2.0 [65] Rcpp_1.0.9 vctrs_0.4.1 [67] png_0.1-7 tidyselect_1.1.2 ```