Sequence indcies out of bound when using matchPattern.

Bioconductor / Biostrings

Efficient manipulation of biological strings

54 stars 16 forks source link

I have noticed a problem wrt the iranges output when searching a pattern in a dna sequence with 1 mismatch allowed: See example: " dna_string <- "TAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG" matchPattern(pattern = "TTAGGG", subject = dna_string, max.mismatch = 1)

output: Views on a 52-letter BString subject subject: TAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG views: start end width [1] 0 5 6 [ TAGGG] [2] 6 11 6 [TTAGGG] [3] 12 17 6 [TTAGGG] [4] 18 23 6 [TTAGGG] [5] 24 29 6 [TTAGGG] [6] 30 35 6 [TTAGGG] [7] 36 41 6 [TTAGGG] [8] 42 47 6 [TTAGGG] [9] 48 53 6 [TTAGG ]

" You can see that the pattern at the begining of the sequence missing a Tat the start of the sequence and the pattern at the end is missing the last 'G'. In my opinion the first irange should be [1,5] and the last one should be [48, 52] or the result should be without the first and last results since the run was with no indels and the iranges are out of bound wrt the seqeunce length. Am I missing something with the definition of the results ?

I would appreciate your response, Dan.

This is a feature. The matchPattern/vmatchPattern family of functions treat out-of-bound nucleotide positions as mismatches (think of these positions as having letters that don't match any letter in DNA_ALPHABET).

Note that the out-of-bound matches are easy to remove:

library(Biostrings)

matches <- matchPattern("ATGG", subject, max.mismatch=1)
matches
# Views on a 16-letter DNAString subject
# subject: AATGCGCGTGGATATG
# views:
#       start end width
#   [1]     2   5     4 [ATGC]
#   [2]     8  11     4 [GTGG]
#   [3]    14  17     4 [ATG ]

## Remove the out-of-bound matches:
keep_me <- (0 <= start(matches)) & (end(matches) <= length(subject(matches)))
matches[keep_me]
# Views on a 16-letter DNAString subject
# subject: AATGCGCGTGGATATG
# views:
#       start end width
#   [1]     2   5     4 [ATGC]
#   [2]     8  11     4 [GTGG]

Alternatively, you can trim the out-of-bound matches:

trim(matches)
# Views on a 16-letter DNAString subject
# subject: AATGCGCGTGGATATG
# views:
#       start end width
#   [1]     2   5     4 [ATGC]
#   [2]     8  11     4 [GTGG]
#   [3]    14  16     3 [ATG]

Hope this helps, H.

Bioconductor / Biostrings

Sequence indcies out of bound when using matchPattern. #102