Closed EricEdwardBryant closed 3 years ago
Fixed in #9
Issue was caused by use of stringr::str_locate_all()
to locate codons in a coding sequence string (CDS).
The original implementation assumed that the following would find three locations of the "AAA" codon, when in fact it will only find one location due to regex consuming the matching portion of the string.
stringr::str_locate_all("TAAAAAT", "AAA")
# [[1]]
# start end
# [1,] 2 4
For regex to match all instances of an overlapping pattern, the pattern would need to have zero length like so:
stringr::str_locate_all("TAAAAAT", "(?=AAA)")
# [[1]]
# start end
# [1,] 2 1
# [2,] 3 2
# [3,] 4 3
Instead of using regex, I opted for a simpler approach that splits the CDS string into a vector of codons. This makes for easier to read code that will not fall prey to regex "gotcha"s.
Relevant quote here
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
There is an issue with
iSTOP::locate_codons()
. Some expected codons are not returned. See example below: