CicciaLab / iSTOP

2 stars 3 forks source link

`locate_codons()` does not find all expected codons #8

Closed EricEdwardBryant closed 3 years ago

EricEdwardBryant commented 3 years ago

There is an issue with iSTOP::locate_codons(). Some expected codons are not returned. See example below:

# Given a reference genome and a set of CDS coordinates
hg38 <- BSgenome.Hsapiens.UCSC.hg38::Hsapiens

CDS <-
  read_csv(
    "tx,gene,exon,chr,strand,start,end
     ENST00000379143.10,PCNA,1,chr20,-,5119578,5119798
     ENST00000379143.10,PCNA,2,chr20,-,5118769,5118866
     ENST00000379143.10,PCNA,3,chr20,-,5118610,5118677
     ENST00000379143.10,PCNA,4,chr20,-,5117470,5117664
     ENST00000379143.10,PCNA,5,chr20,-,5115449,5115572
     ENST00000379143.10,PCNA,6,chr20,-,5115283,5115362",
    col_types = "cciccii"
  )

# Attempt to look up an AAA codon
bad_result <-
  iSTOP::locate_codons(
    CDS,
    hg38,
    codons = "AAA",
    positions = 1L,
    switch_strand = TRUE
  )

# Manually get the coding sequence for this transcript
cds_dna <-
  BSgenome::getSeq(hg38, as(select(CDS, chr, strand, start, end), "GRanges")) %>%
  as.character() %>%
  str_c(collapse = "")

# Manually split into codons
codon_starts <- seq(1, str_length(cds_dna), by = 3L)
codon_ends   <- codon_starts + 2L
codons <- str_sub(cds_dna, start = codon_starts, end = codon_ends)

# We know that there should be "AAA" at these positions
codons[c(80, 164, 217, 248)]

# But these are not returned by iSTOP::locate_codons() in `bad_result`
setdiff(which(codons == "AAA"), bad_result$aa_coord)
CicciaLab commented 3 years ago

Fixed in #9

Issue was caused by use of stringr::str_locate_all() to locate codons in a coding sequence string (CDS).

The original implementation assumed that the following would find three locations of the "AAA" codon, when in fact it will only find one location due to regex consuming the matching portion of the string.

stringr::str_locate_all("TAAAAAT", "AAA")
# [[1]]
#      start end
# [1,]     2   4

For regex to match all instances of an overlapping pattern, the pattern would need to have zero length like so:

stringr::str_locate_all("TAAAAAT", "(?=AAA)")
# [[1]]
#      start end
# [1,]     2   1
# [2,]     3   2
# [3,]     4   3

Instead of using regex, I opted for a simpler approach that splits the CDS string into a vector of codons. This makes for easier to read code that will not fall prey to regex "gotcha"s.

Relevant quote here

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.