f-hamidlab / factR2

Functional Annotation of Custom Transcriptomes v2
https://f-hamidlab.github.io/factR2/
Apache License 2.0
3 stars 1 forks source link

Should every skipped exon be called a cassette exon? #4

Closed isaacvock closed 6 months ago

isaacvock commented 9 months ago

Hello,

First off, thank you for this amazing tool! It has already proven incredibly useful in my research and is very well documented and easy to use.

The short version of my question: Would it be possible for factR2 to distinguish cassette exons that are only included in a single annotated isoform from those which are included in multiple isoforms and may have alternative 5' and 3' splice sites in some of these isoforms?

Longer version of my question: I dug into the source code for the .runAS() function called to assign splicing event type names to alternative exons, and was a bit surprised by how intronic regions are defined and used to annotate splicing events. To illustrate what I am referring to, I crafted a simple test example using a theoretical annotation for a single gene:

Test_Annotation

Code to annotate splice events (basically just .runAS with "test" appended to end of object names):

# Simple test data based on theoretical transcript structure
testdata <- data.frame(start = c(100, 1500, 3000, 
                     100, 1500, 3000,
                     100, 3200), 
           end = c(500, 2000, 3500, 
                   500, 2200, 3500,
                   500, 3500),
           transcript_id = c("A", "A", "A", 
                             "B", "B", "B", 
                             "C", "C"),
           seqnames = "chr1",
           gene_id = "A",
           gene_name = "A")

# Process exon coordinates
xtest <- testdata %>% 
  group_by(transcript_id) %>% 
  dplyr::arrange(start) %>% 
  mutate(pos = dplyr::row_number()) %>% 
  mutate(pos = ifelse(pos == 1, "First", pos)) %>% 
  mutate(pos = ifelse(pos == dplyr::n(), "Last", pos)) %>%
  dplyr::group_by(gene_id) %>%
  dplyr::arrange(start, end) %>%
  dplyr::mutate(termini = dplyr::row_number()) %>%
  dplyr::mutate(termini = ifelse(termini == 1 | termini == dplyr::n(), T, F)) %>%
  dplyr::group_by(seqnames, end, gene_id) %>%
  dplyr::mutate(start = ifelse(pos == "First", start[dplyr::n()], start)) %>%
  dplyr::group_by(seqnames, start, gene_id) %>%
  dplyr::mutate(end = ifelse(pos == "Last", end[1], end)) %>%
  GenomicRanges::makeGRangesFromDataFrame(keep.extra.columns = T)

# Map genes to transcripts
t2gtest <- xtest %>%
  as.data.frame() %>%
  dplyr::select(gene_id, transcript_id) %>%
  dplyr::distinct()

# Get exonic and intronic regions
exonsbytxtest <- S4Vectors::split(xtest, ~transcript_id)
intronsbytxtest <- GenomicRanges::psetdiff(BiocGenerics::unlist(range(exonsbytxtest)), exonsbytxtest)

# Get a gene's total intronic region
intronsbygrouptest <- t2gtest %>%
  dplyr::left_join(as.data.frame(intronsbytxtest), by = c("transcript_id" = "group_name")) %>%
  dplyr::filter(!is.na(seqnames)) %>%
  GenomicRanges::makeGRangesListFromDataFrame(split.field = "gene_id") %>%
  GenomicRanges::reduce() %>%
  as.data.frame() %>%
  GenomicRanges::makeGRangesFromDataFrame(keep.extra.columns = TRUE)

# Find alternative exons
altexonstest <- IRanges::findOverlapPairs(xtest, intronsbygrouptest)

# Annotate splicing events
altannotatetest <- altexonstest %>%
  as.data.frame() %>%
  dplyr::mutate(AStype = "CE") %>%
  dplyr::mutate(AStype = ifelse(first.X.start < second.X.start & first.X.end < second.X.end, "Ad", AStype)) %>%
  dplyr::mutate(AStype = ifelse(first.X.start > second.X.start & first.X.end > second.X.end, "Aa", AStype)) %>%
  dplyr::mutate(AStype = ifelse(first.X.start < second.X.start & first.X.end <= second.X.end & first.pos == "Last", NA, AStype)) %>%
  dplyr::mutate(AStype = ifelse(first.X.start >= second.X.start & first.X.end > second.X.end & first.pos == "First", NA, AStype)) %>%
  dplyr::mutate(AStype = ifelse(first.X.start < second.X.start & first.X.end > second.X.end, "RI", AStype)) %>%
  dplyr::mutate(AStype = ifelse(first.X.start >= second.X.start & first.X.end < second.X.end & first.pos == "First", "Af", AStype)) %>%
  dplyr::mutate(AStype = ifelse(first.X.start > second.X.start & first.X.end <= second.X.end & first.pos == "Last", "Al", AStype)) %>%
  dplyr::mutate(AStype = ifelse(first.X.strand == "-", chartr("adfl", "dalf", AStype), AStype)) %>%
  dplyr::mutate(same.group = ifelse(first.gene_id == second.group_name,TRUE,FALSE))

altannotatetest

# get segments that fall within intron and annotate that segment
altexonstest <- GenomicRanges::pintersect(altexonstest)
if (length(altexonstest) == 0) {
  altexonstest$pos <- altexonstest$hit <- altexonstest$termini <- altexonstest$grouping<- NULL

} else {
  altexonstest$type <- "AS"
  altexonstest$AStype <- toupper(altannotatetest$AStype)
  altexonstest <- altexonstest[!is.na(altexonstest$AStype) & altannotatetest$same.group]
  altexonstest$pos <- altexonstest$hit <- altexonstest$termini <-altexonstest$grouping <- NULL

  # add ID
  altexonstest.id <- altexonstest %>%
    as.data.frame() %>%
    dplyr::distinct(seqnames, start, end, strand, gene_id, gene_name, AStype) %>%
    dplyr::mutate(AS_id = sprintf("AS%05d", dplyr::row_number()))

  altexonstest <- altexonstest %>%
    as.data.frame() %>%
    dplyr::left_join(altexonstest.id, by = c("seqnames", "start", "end",
                                         "strand", "gene_id", "gene_name",
                                         "AStype")) %>%
    GenomicRanges::makeGRangesFromDataFrame(keep.extra.columns = T)

  altexonstest <- BiocGenerics::sort(altexonstest)
}

as.data.frame(altexonstest)

which yields the final output of:

AS_output

Of note is that the internal exons of both transcripts A and B are called "cassette exons" by factR2's strategy, since they are completely contained within a region that is intronic in at least one annotated isoform.

My expectation going into this exercise was that the splice annotation would note that the exon in transcript B is an alternative donor isoform of transcript A's internal exon. Would it be possible for future versions of factR2 to annotate the exon in transcript A as a cassette exon, but the exon in transcript B as an alternative donor event? Is there precedence in other tools that annotate splice events (e.g., alternative splice analysis tools like rMATS) to handle "cassette exons" that are included in only one isoform differently from those included in multiple isoforms?

I am interested in this functionality as I would like to identify instances where alternative splicing of the "cassette exon" influences properties of the two "cassette exon" containing isoforms (i.e., their half-lives).

Best, Isaac

fursham-h commented 9 months ago

Hi @isaacvock,

Thank you very much for your comment and your interest in factR2.

Yes, we are aware of this limitation of the .runAS feature and the current implementation is designed to pick up most common splicing types. We are constantly thinking of ways to improve this without compromising runtime, and we will try to detect more complex examples as you've described.

fursham-h commented 6 months ago

Hi @isaacvock , thank you for your patience with this. We have improved the splicing detection strategy of factR2 and it should now distinguish complex splicing events. Do give this a try and let us know if there are any problems.