Closed josiegleeson closed 6 months ago
Never do this in pure R, this is what the c wrappers are for :D Since you do not know how many introns you need to jump, these processes are usually pure c++ for loops.
Yes so I would use the function ORFik::pmapFromTranscriptF(), this is my fastest c function for that kind of stuff.
It is called "pmap" and not "map" since you must first defined pairs before you insert, that is what makes it fast.
So define like this:
# Use proper txdb from ORFik (do this once, then load txdb from path, much faster and has full seqinfo)
txdb <- ORFik::makeTxdbFromGenome(gtf, genome, organism, optimize = TRUE, return = TRUE)
# IRanges of peptides/ORFs.
pep <- peptide_transcript_coords # what you called it
pep_names <- as.character(seqnames(pep))
# Use ORFik::loadRegion or whatever to get the grl object
genes_in_genomic_coords <- ORFik::loadRegion(txdb, "tx") # All coding and noncoding transcripts as GRangesList(...)
# Now match them
names(pep) <- match(pep_names, names(genes_in_genomic_coords)) # matching
# Now the tests
stopifnot(all(pep_names %in% names(genes_in_genomic_coords))) # Correct order and all are included, pep must be named after transcript it belonged to.
stopifnot(!anyNA(names(pep))) # All matched, no NAs
# Run pmap
pep_in_genomic <- ORFik::pmapFromTranscriptF(pep, genes_in_genomic_coords, removeEmpty = TRUE)
Just made a prototype in my head without testing, so let me know if it worked :)
Hi Håkon, thanks so much for the quick reply! I had a feeling you'd already have something figured out for this.
Looks like it works! Returns spliced coordinates which is awesome and currently appears to be insanely faster than what I had written lol.
I am getting this error at some point, but I will troubleshoot it. I'm guessing one of my ORFs/peptides coordinates is incorrect and doesn't map anywhere. Any idea how I can just skip these entries?
pep_in_genomic <- ORFik::pmapFromTranscriptF(pep, genes_in_genomic_coords, removeEmpty = T)
Error in ORFik::pmapFromTranscriptF(pep, genes_in_genomic_coords, :
Invalid ranges to map, check them. One is bigger than its reference
Thanks again!!
Another quick question, is there any way to preserve the metadata columns within the pep object?
I have the peptide names there...
I've tried wrapping it within a function and assigning the mcols back, but once again it significantly slows it down.
Thanks!
I usually do something like this, this is instant time usually:
pep_ids <- pep$peptide_id # The original ids in the GRanges object
pep_in_genomic@unlistData$pep_ids <- pep_ids[groupings(pep_in_genomic)] # Map back to GRangesList, with group information
To check that something is within, do this:
GRangesList(GRanges("chr1", 1, "+"), GRanges("chr1", IRanges(300, 311), "+"), GRanges("chr1", 306, "+")) %within% GRangesList(GRanges("chr1", IRanges(300, 310), "+")) # Only third is within
This should not happen in theory, so that means you have some peptides outside the gene boundary, i.e. you used another annotation for finding the peptides or something.
Awesome, its working really well now and has sped up my workflow a tonne. Thanks again
Hi ORFik team,
Thanks for creating and maintaining such a useful package! I've been using it lots. I have a query, not sure if ORFik already does this in some way so perhaps you could help me.
I'm trying to create a proteogenomics pipeline and have been using some functions from ORFik. But once I am integrating my proteomics data back into the ORFik data I want to be able to map my peptides to the genome.
I have long-read data that I've discovered novel isoforms (and ORFs) within. Now I have peptides mapped to these isoforms/ORFs. I have a metadata file already with my transcript-level coordinates of the ORF and transcript, as well as genome-level coordinates of the ORF and transcript. Therefore, I can work out the transcript-level coordinates of a mapped peptide by working out where it is within the ORF. The package 'mapFromTranscripts' seems to be almost exactly what I want, but it only returns the first and last genomic coordinates of the peptide, therefore is not splice-aware of exons.
I've written a custom function to do this but its very slow and convoluted and I feel like ORFik has similar functionality so I might be able to use a function within ORFik instead.
Thanks for any help or assistance with this, I really appreciate it!
Here's my code in R: