Bioconductor / Biostrings

Efficient manipulation of biological strings
https://bioconductor.org/packages/Biostrings
57 stars 16 forks source link

Is there any way to extract a sequence between two other sequences and allow for mismatches #56

Closed loukesio closed 2 years ago

loukesio commented 3 years ago

First of all thank you for maintaining and developing Biostrings and you make our lives better everyday. The following might not be an issue, but mostly its a request/ I am asking for help. Please accept my apologies in advance for taking the advantage. I have a string that looks like this

dna_string <- DNAString("AAAAANNNNNNNNNNNNNNNNNNNNNNNNNCCCCC")

I want to find a way to extract the sequence between left-pattern=AAAAA and right pattern=CCCCC and allow for mismatches on left and right. I would like to set the minimum distance between left and right pattern to six and the maximum to 30.

Do you have any idea if this is possible? I am aware of the super cool

matchLRPatterns("AAAAA", "CCCCC", 25, dna_string)

but in here I can only set the maximum distance and not the minimum?

hpages commented 2 years ago

This issue is very old! I completely missed it, sorry.

How about filtering the matches returned by matchLRPatterns() to keep only those that have a minimum width? Something like this:

> library(Biostrings)
> dna_string <- DNAString("GTGTGTAAAAANNNNNGTGTNNNNNNNAAAANNNNNNNCCCCCAGAG")
> matches <- matchLRPatterns("AAAA", "CCCCC", 35, dna_string)

> matches
Views on a 47-letter DNAString subject
subject: GTGTGTAAAAANNNNNGTGTNNNNNNNAAAANNNNNNNCCCCCAGAG
views:
      start end width
  [1]     7  43    37 [AAAAANNNNNGTGTNNNNNNNAAAANNNNNNNCCCCC]
  [2]     8  43    36 [AAAANNNNNGTGTNNNNNNNAAAANNNNNNNCCCCC]
  [3]    28  43    16 [AAAANNNNNNNCCCCC]

> matches[width(matches) >= 20]
Views on a 47-letter DNAString subject
subject: GTGTGTAAAAANNNNNGTGTNNNNNNNAAAANNNNNNNCCCCCAGAG
views:
      start end width
  [1]     7  43    37 [AAAAANNNNNGTGTNNNNNNNAAAANNNNNNNCCCCC]
  [2]     8  43    36 [AAAANNNNNGTGTNNNNNNNAAAANNNNNNNCCCCC]

This is really a question about basic usage of the package. Note that those questions are better asked on the Bioconductor support site here: https://support.bioconductor.org , where they get a lot more exposure and are more likely to get quick attention.

Best, H.