Closed Dan-Lt closed 5 months ago
This is a feature. The matchPattern
/vmatchPattern
family of functions treat out-of-bound nucleotide positions as mismatches (think of these positions as having letters that don't match any letter in DNA_ALPHABET
).
Note that the out-of-bound matches are easy to remove:
library(Biostrings)
matches <- matchPattern("ATGG", subject, max.mismatch=1)
matches
# Views on a 16-letter DNAString subject
# subject: AATGCGCGTGGATATG
# views:
# start end width
# [1] 2 5 4 [ATGC]
# [2] 8 11 4 [GTGG]
# [3] 14 17 4 [ATG ]
## Remove the out-of-bound matches:
keep_me <- (0 <= start(matches)) & (end(matches) <= length(subject(matches)))
matches[keep_me]
# Views on a 16-letter DNAString subject
# subject: AATGCGCGTGGATATG
# views:
# start end width
# [1] 2 5 4 [ATGC]
# [2] 8 11 4 [GTGG]
Alternatively, you can trim the out-of-bound matches:
trim(matches)
# Views on a 16-letter DNAString subject
# subject: AATGCGCGTGGATATG
# views:
# start end width
# [1] 2 5 4 [ATGC]
# [2] 8 11 4 [GTGG]
# [3] 14 16 3 [ATG]
Hope this helps, H.
I have noticed a problem wrt the iranges output when searching a pattern in a dna sequence with 1 mismatch allowed: See example: " dna_string <- "TAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG" matchPattern(pattern = "TTAGGG", subject = dna_string, max.mismatch = 1)
output: Views on a 52-letter BString subject subject: TAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG views: start end width [1] 0 5 6 [ TAGGG] [2] 6 11 6 [TTAGGG] [3] 12 17 6 [TTAGGG] [4] 18 23 6 [TTAGGG] [5] 24 29 6 [TTAGGG] [6] 30 35 6 [TTAGGG] [7] 36 41 6 [TTAGGG] [8] 42 47 6 [TTAGGG] [9] 48 53 6 [TTAGG ]
" You can see that the pattern at the begining of the sequence missing a Tat the start of the sequence and the pattern at the end is missing the last 'G'. In my opinion the first irange should be [1,5] and the last one should be [48, 52] or the result should be without the first and last results since the run was with no indels and the iranges are out of bound wrt the seqeunce length. Am I missing something with the definition of the results ?
I would appreciate your response, Dan.