`locate_codons()` does not find all expected codons

# Given a reference genome and a set of CDS coordinates hg38 <- BSgenome.Hsapiens.UCSC.hg38::Hsapiens CDS <- read_csv( "tx,gene,exon,chr,strand,start,end ENST00000379143.10,PCNA,1,chr20,-,5119578,5119798 ENST00000379143.10,PCNA,2,chr20,-,5118769,5118866 ENST00000379143.10,PCNA,3,chr20,-,5118610,5118677 ENST00000379143.10,PCNA,4,chr20,-,5117470,5117664 ENST00000379143.10,PCNA,5,chr20,-,5115449,5115572 ENST00000379143.10,PCNA,6,chr20,-,5115283,5115362", col_types = "cciccii" ) # Attempt to look up an AAA codon bad_result <- iSTOP::locate_codons( CDS, hg38, codons = "AAA", positions = 1L, switch_strand = TRUE ) # Manually get the coding sequence for this transcript cds_dna <- BSgenome::getSeq(hg38, as(select(CDS, chr, strand, start, end), "GRanges")) %>% as.character() %>% str_c(collapse = "") # Manually split into codons codon_starts <- seq(1, str_length(cds_dna), by = 3L) codon_ends <- codon_starts + 2L codons <- str_sub(cds_dna, start = codon_starts, end = codon_ends) # We know that there should be "AAA" at these positions codons[c(80, 164, 217, 248)] # But these are not returned by iSTOP::locate_codons() in `bad_result` setdiff(which(codons == "AAA"), bad_result$aa_coord)

Fixed in #9

Issue was caused by use of stringr::str_locate_all() to locate codons in a coding sequence string (CDS).

The original implementation assumed that the following would find three locations of the "AAA" codon, when in fact it will only find one location due to regex consuming the matching portion of the string.

stringr::str_locate_all("TAAAAAT", "AAA")
# [[1]]
#      start end
# [1,]     2   4

For regex to match all instances of an overlapping pattern, the pattern would need to have zero length like so:

stringr::str_locate_all("TAAAAAT", "(?=AAA)")
# [[1]]
#      start end
# [1,]     2   1
# [2,]     3   2
# [3,]     4   3

Instead of using regex, I opted for a simpler approach that splits the CDS string into a vector of codons. This makes for easier to read code that will not fall prey to regex "gotcha"s.

Relevant quote here

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

CicciaLab / iSTOP

`locate_codons()` does not find all expected codons #8