Roleren / ORFik

MIT License
32 stars 9 forks source link

Feature request - map peptides back to ORFs and genomic locations (splice-aware) #175

Closed josiegleeson closed 2 months ago

josiegleeson commented 2 months ago

Hi ORFik team,

Thanks for creating and maintaining such a useful package! I've been using it lots. I have a query, not sure if ORFik already does this in some way so perhaps you could help me.

I'm trying to create a proteogenomics pipeline and have been using some functions from ORFik. But once I am integrating my proteomics data back into the ORFik data I want to be able to map my peptides to the genome.

I have long-read data that I've discovered novel isoforms (and ORFs) within. Now I have peptides mapped to these isoforms/ORFs. I have a metadata file already with my transcript-level coordinates of the ORF and transcript, as well as genome-level coordinates of the ORF and transcript. Therefore, I can work out the transcript-level coordinates of a mapped peptide by working out where it is within the ORF. The package 'mapFromTranscripts' seems to be almost exactly what I want, but it only returns the first and last genomic coordinates of the peptide, therefore is not splice-aware of exons.

I've written a custom function to do this but its very slow and convoluted and I feel like ORFik has similar functionality so I might be able to use a function within ORFik instead.

Thanks for any help or assistance with this, I really appreciate it!

Here's my code in R:

# use GenomicFeatures to make txdb of gtf
  gtf <- makeTxDbFromGFF(gtf_file)
  exonsdb <- exonsBy(gtf, "tx", use.names=TRUE)

# map transcriptomic peptide coordinates to genomic coordinates
map_peptides_to_genome <- function(peptide_tx_coords, exonsdb) {

  # map peptide transcript coords to genomic coords
  # NOT splice aware
  peptide_genome_coords <- mapFromTranscripts(peptide_tx_coords, exonsdb, ignore.strand = FALSE)

  # define transcript name
  tx_id <- as.character(seqnames(peptide_tx_coords)[1])
  exon_data <- exonsdb[[tx_id]]

  # get exon start and ends and cumulative lengths per transcript
  ex_start <- start(exon_data)
  ex_end <- end(exon_data)
  cumulative_lengths <- cumsum(width(exon_data))

  # get splice aware coordinates of every peptide
  mapped_ranges <- lapply(seq_along(peptide_genome_coords), function(i) {

    # get start and end pos
    coord_start <- start(peptide_genome_coords[i])
    coord_end <- end(peptide_genome_coords[i])

    # find intersecting exons with peptide start/end pos
    intersecting_exons <- which(ex_start <= coord_end & ex_end >= coord_start)

    # get genomic positions
    if (length(intersecting_exons) > 0) {

      positions <- unlist(lapply(intersecting_exons, function(exon_idx) {
        relative_start <- max(coord_start, ex_start[exon_idx]) - ex_start[exon_idx] + 1
        relative_end <- min(coord_end, ex_end[exon_idx]) - ex_start[exon_idx] + 1
        genomic_start <- ex_start[exon_idx] + relative_start - 1
        genomic_end <- ex_start[exon_idx] + relative_end - 1
        IRanges(genomic_start, genomic_end)}))

      GenomicRanges::reduce(do.call(c, positions))

    } else {

      IRanges(integer(0), integer(0))  # empty range if no intersection

    }
  })

  if (length(mapped_ranges) == 0) {
    GRanges()
  } else {
    # output GRanges object
    GRanges(seqnames = rep(paste0(seqnames(exon_data)[1]), length(mapped_ranges[[1]])),
            ranges = mapped_ranges[[1]],
            strand = rep(strand(exon_data)[1], length(mapped_ranges[[1]])),
            txname = rep(paste0(seqnames(peptide_tx_coords)[1]), length(mapped_ranges[[1]])),
            gene_id = rep(paste0(mcols(peptide_tx_coords)$gene), length(mapped_ranges[[1]])),
            PID = rep(paste0(mcols(peptide_tx_coords)$PID), length(mapped_ranges[[1]])),
            peptide = rep(paste0(mcols(peptide_tx_coords)$peptide), length(mapped_ranges[[1]]))
    )}

}

# apply map_peptides_to_genome to each element using index
apply_map_peptides_to_genome_function <- function(index) {
  coords <- peptide_transcript_coords[index]  # subset GRanges by index
  map_peptides_to_genome(coords, exons_filt)
}

results_pept <- GRangesList(lapply(seq_along(peptide_transcript_coords), apply_map_peptides_to_genome_function))
Roleren commented 2 months ago

Never do this in pure R, this is what the c wrappers are for :D Since you do not know how many introns you need to jump, these processes are usually pure c++ for loops.

Yes so I would use the function ORFik::pmapFromTranscriptF(), this is my fastest c function for that kind of stuff.

It is called "pmap" and not "map" since you must first defined pairs before you insert, that is what makes it fast.

So define like this:

# Use proper txdb from ORFik (do this once, then load txdb from path, much faster and has full seqinfo)
txdb <- ORFik::makeTxdbFromGenome(gtf, genome, organism, optimize = TRUE, return = TRUE)
# IRanges of peptides/ORFs.
pep <- peptide_transcript_coords # what you called it
pep_names <- as.character(seqnames(pep))

# Use ORFik::loadRegion or whatever to get the grl object
genes_in_genomic_coords <- ORFik::loadRegion(txdb, "tx") # All coding and noncoding transcripts as GRangesList(...) 
# Now match them
names(pep) <- match(pep_names, names(genes_in_genomic_coords)) # matching

# Now the tests
stopifnot(all(pep_names %in% names(genes_in_genomic_coords))) # Correct order and all are included, pep must be named after transcript it belonged to. 
stopifnot(!anyNA(names(pep))) # All matched, no NAs
# Run pmap
pep_in_genomic <- ORFik::pmapFromTranscriptF(pep, genes_in_genomic_coords, removeEmpty = TRUE)

Just made a prototype in my head without testing, so let me know if it worked :)

josiegleeson commented 2 months ago

Hi Håkon, thanks so much for the quick reply! I had a feeling you'd already have something figured out for this.

Looks like it works! Returns spliced coordinates which is awesome and currently appears to be insanely faster than what I had written lol.

I am getting this error at some point, but I will troubleshoot it. I'm guessing one of my ORFs/peptides coordinates is incorrect and doesn't map anywhere. Any idea how I can just skip these entries?

pep_in_genomic <- ORFik::pmapFromTranscriptF(pep, genes_in_genomic_coords, removeEmpty = T)
Error in ORFik::pmapFromTranscriptF(pep, genes_in_genomic_coords,  : 
  Invalid ranges to map, check them. One is bigger than its reference

Thanks again!!

josiegleeson commented 2 months ago

Another quick question, is there any way to preserve the metadata columns within the pep object?

I have the peptide names there...

I've tried wrapping it within a function and assigning the mcols back, but once again it significantly slows it down.

Thanks!

Roleren commented 2 months ago

I usually do something like this, this is instant time usually:

pep_ids <- pep$peptide_id # The original ids in the GRanges object
pep_in_genomic@unlistData$pep_ids <- pep_ids[groupings(pep_in_genomic)] # Map back to GRangesList, with group information
Roleren commented 2 months ago

To check that something is within, do this: GRangesList(GRanges("chr1", 1, "+"), GRanges("chr1", IRanges(300, 311), "+"), GRanges("chr1", 306, "+")) %within% GRangesList(GRanges("chr1", IRanges(300, 310), "+")) # Only third is within

This should not happen in theory, so that means you have some peptides outside the gene boundary, i.e. you used another annotation for finding the peptides or something.

josiegleeson commented 2 months ago

Awesome, its working really well now and has sped up my workflow a tonne. Thanks again