Closed josefinaperalba closed 11 months ago
Hey, Ill send what you need tomorrow morning
Have you also ran an analysis to do QC on the results that you can share?
So, ORFik is highly optimized for speed, there is no point in subsetting bam files. That is why we have made a format called "ofst", based on the fst format from facebook. To convert bam files to ofst, see tutorial linked bellow (ofst will load entire file in < 1 second, compared to bam)
A note is you only run feature selection like RRS on p-site shifted reads! Not on the bam files.
So, First create ORFik.experiment
Then run:
Here is full tutorial script for ribo-seq:
https://bioconductor.org/packages/release/bioc/vignettes/ORFik/inst/doc/Ribo-seq_pipeline.html
If you have already aligned to bam files, start from this line:
# Create experiment (Starting point if alignment is finished)
For section where I get RRS scores, I do this using the ORFik feature wrapping function called
computeFeatures, which is found in tutorial in section:
# Feature table (From WT rep 1)
I you have any questions, just ask
Hi, my objective is calculate the RRS score for different orfs that are already detected by another pipeline.
Until now I've been calculating the RRS with the script that i leave below. I realize now that I missing the pshifting section for the bam files, is that mandatory? is my script okay and I can save my memory requirement issue by changing from bam to ofst? Thanks!
Script:
#!/usr/bin/env Rscript
# Check if required arguments are provided
if (is.null(opt$samples) || is.null(opt$orfs) || is.null(opt$output) || is.null(opt$gtf)) {
stop("Required arguments are missing", call. = FALSE)
}
# Load required libraries with suppressed messages
suppressMessages(library(ORFik))
suppressMessages(library(GenomicAlignments))
suppressMessages(library(tidyr))
suppressMessages(library(parallel))
# Read the sample sheet
sample_sheet <- read.csv(opt$samples, header = TRUE)
# Load ORFs summary
orf_summary <- read.csv(opt$orfs, sep = "\t")
orf_summary$samples <- strsplit(paste(orf_summary$ribocode_hit_samples, orf_summary$ribotricer_hit_samples, sep = ","), ",")
orf_summary$start <- lapply(orf_summary$location, function(x) strsplit(strsplit(x, ":")[[1]][2], "-")[[1]][1])
orf_summary$end <- lapply(orf_summary$location, function(x) strsplit(strsplit(x, ":")[[1]][2], "-")[[1]][2])
# Load reference GTF
reference_gtf <- read.csv(opt$gtf, sep = "\t", comment.char = "#", header = FALSE, col.names = c("seqname", "source", "feature", "start", "end", "score", "strand", "frame", "attribute"))
# Function to compute RRS for a single sample
compute_RRS <- function(sample_name, bam_file, gtf_file, orf_summary) {
local_reference_gtf <- reference_gtf
# Filter for the current sample
sample_orf_summary <- orf_summary[sapply(orf_summary$samples, FUN = function(x) is.element(sample_name, x)), ]
# Load Ribo-Seq reads
riboseq_reads <- readGAlignments(bam_file, index = paste0(bam_file, ".bai"), use.names = TRUE)
# Extract 3' UTRs from reference GTF
three_utrs <- local_reference_gtf[local_reference_gtf$feature == "three_prime_utr", ]
three_utrs$transcript_id <- lapply(three_utrs$attribute, function(x) sub(";.*", "", sub(".*transcript_id ", "", x)))
# Filter out chromosomes that are not present in all datasets
chromosomes <- Reduce(intersect, list(sample_orf_summary$chrom, unique(seqnames(riboseq_reads)), three_utrs$seqname))
sample_orf_summary <- sample_orf_summary[sample_orf_summary$chrom %in% chromosomes, ]
riboseq_reads <- riboseq_reads[seqnames(riboseq_reads) %in% chromosomes]
three_utrs <- three_utrs[three_utrs$seqname %in% chromosomes, ]
# Match ORFs and 3' UTRs
sample_orf_summary$granges <- lapply(1:nrow(sample_orf_summary), function(i) {
GRanges(seqnames = sample_orf_summary[i, "chrom"],
ranges = IRanges(start = as.numeric(sample_orf_summary[i, "start"]),
end = as.numeric(sample_orf_summary[i, "end"])),
strand = sample_orf_summary[i, "strand"])
})
three_utrs$granges <- lapply(1:nrow(three_utrs), function(i) {
GRanges(seqnames = three_utrs[i, "seqname"],
ranges = IRanges(start = as.numeric(three_utrs[i, "start"]),
end = as.numeric(three_utrs[i, "end"])),
strand = three_utrs[i, "strand"])
})
orfs <- separate_rows(sample_orf_summary, "transcript_id", "transcript_type", sep = ",")
merged <- merge(x = orfs, y = three_utrs, by.x = "transcript_id", by.y = "transcript_id")
# Create ORF GRangesList
grl <- GRangesList(merged$granges.x)
names(grl) <- merged$transcript_id
# Create 3' UTR GRangesList
three_utrs_granges <- GRangesList(merged$granges.y)
names(three_utrs_granges) <- merged$transcript_id
# Compute RRS
rrs <- ribosomeReleaseScore(grl = grl, RFP = riboseq_reads, GtfOrThreeUtrs = three_utrs_granges)
# Return a data frame with ORF ID and RRS score
result <- data.frame(orf_id = merged$ORF_ID, rrs = rrs, sample_name = sample_name)
result_maxrrs<- setNames(aggregate(result$rrs, by = list(result$orf_id), FUN = max), c("orf_id", "max_rrs"))
result <- merge(result, result_maxrrs, by = "orf_id", all.x = TRUE)
result <- result[result$rrs == result$max_rrs, ]
result$row_num <- ave(result$rrs, result$orf_id, FUN = seq_along)
result <- result[result$row_num == 1, ]
result$max_rrs <- NULL
result$row_num <- NULL
return(result)
}
# Initialize a cluster for parallel processing
cl <- makeCluster(opt$cores)
clusterExport(cl, c("GRanges", "IRanges"))
sample_sheet_list <- split(sample_sheet, seq(nrow(sample_sheet)))
# Create a function to compute RRS in parallel
compute_RRS_single_row <- function(row) {
compute_RRS(row$sample, row$bam, reference_gtf, orf_summary)
}
# Use mclapply to compute RRS scores in parallel
result_list <- mclapply(sample_sheet_list, compute_RRS_single_row, mc.cores = opt$cores)
# Close the cluster
stopCluster(cl)
# Combine RRS results into a single data frame
combined_results <- do.call(rbind, result_list)
# Write the combined RRS results to the output file
write.table(combined_results, opt$output, row.names = FALSE, sep = "\t")
cat("All done!\n")
Yes, I would say the pshifting is mandatory (your results will be a bit inaccurate without it). Since bam files have ~ 30nt fragments, the RRS score will hit an orf as long as the fragment hits. While with shifting, it only hits if the p-site is on the orf. Also for memory etc, ofst is much more efficient.
Which ORF prediction algorithm did you use if you do not have p-shifted fragments may I ask ?
Why do you load you gtf like that ? If you have a proper gtf, see this tutorial for how to load 3' UTRs etc in ORFik: https://bioconductor.org/packages/release/bioc/vignettes/ORFik/inst/doc/Importing_Data.html
Some notes:
I advice you to read the ORFik tutorials and making ofst files, and send me questions you have here. As I believe your current method will contain errors and strange results :)
If this is a one time analysis which is not to be published, then I would think your current script is "nearly ok".
I want to calculate the scores at an orf level. I have a tsv file with all the information neccesary to create a Granges of orfs. How can include that in the experiment if i cannot use that directly into the computeFeatures function?
So first the error you sent:
Error in covRleFromGR(x, weight = weight, ignore.strand = ignore.strand): Seqlengths of x contains NA values!
To fix this, please update the gtf to a proper formated txdb (it complains on your gtf definition), with this:
gtf_file <- "/opt/projects/1357_BIP/2022-03_Riboseq_pipeline/data/references/annotations/103_from_BI/Homo_sapiens.GRCh38.103.gtf"
makeTxdbFromGenome(gtf_file, genome_file, "Homo sapiens", optimize = TRUE)
# Then update new path for experiment
gtf_file <- "/opt/projects/1357_BIP/2022-03_Riboseq_pipeline/data/references/annotations/103_from_BI/Homo_sapiens.GRCh38.103.gtf.db"
# Now create experiment again with valid txdb:
create.experiment(
bam_dir,
exper = experiment_name,
fa = genome_file,
txdb = gtf_file,
organism = organism,
rep = replicates,
condition = conditions
)
# Finally re-run
ORFikQC(df, create.ofst = FALSE) # create.ofst = FALSE, To avoid loading bam files again, you already succeded in making ofst
Second question:
For compute features you need a grangeslist, so first you need to do is read in the tsv and convert to GRangesList.
so something like:
df <- read.experiment("GSE")
orfs_path <- "path/to/tsv"
orfs <- rtracklayer::import(orfs_path) # <- this usually loads in from tsv, but if not let me know, that means you need a custom converter, which I can help you with.
computeFeatures(orfs, RFP = rfp_lib, RNA = NULL, Gtf = df@txdb, faFile = df@fafile, weight.RFP = score(rfp_lib))
Perfect! Thanks you so much for your help. I have a doubt, this function handles automatically the matching between my orfs and the 3' UTR regions in the reference file for RRS calculation? What happens if an orf doesn't match to a 3' UTR region?
Ah, yes, forgot to mention that.
So the orfs must be named txname_id
So like first orf of transcript named ENST0000015123, can be named: "ENST0000015123_1" etc. Then it will automatically match.
You can also call them all the tx id, without _1, _2 etc :)
Also if it does not work, you can try to load the tsv, create a GRanges Object from scratch and split on ORF id:
Pseudo code:
orfs <- data.table::fread("tsv_file")
orfs <- GRanges(seqnames = orfs$chromosome, start = orfs$start, end = orfs$end, strand = orfs$strand, tx_id = orfs$tx_id, orf_id = orfs$orf_id)
# If you do not have proper orf_id make it like this:
orf_id <- paste(orfs$tx_id, as.numeric(as.factor(orfs$orf_id))), sep = "_") # Each orf gets a number, such that all exons of same orf have the same number
orfs <- split(orfs, orfs$orf_id) # orfs_id must be made from (tx_id)_(orf_number), then it will work.
Hi thanks so much for your help. Until now with your help i was able to calculate the RRS score using the ribosomeRealeaseScore function but I'm not able to implement the compute features function since I get this error. computeFeatures(orfs, RFP = fimport(filepath(df.rfp[6,], "pshifted")), Gtf = df.rfp@txdb, faFile = df.rfp@fafile, weight.RFP = "score") No RNA added, skipping feature te and fpkm of RNA, also ribosomeReleaseScore will also be not normalized best way possible. Error in covRleFromGR(x, weight = weight, ignore.strand = ignore.strand) : Seqlengths of x contains NA values!
Thanks!!
On another hand, by any chance there are any plans to incorporate the RAI score from this paper 10.1186/s12864-018-4765-z to the library??
Thank you!
Yes, I need to make that error more clear I see.
So what it is you did not add the seqinfo to the "orfs" object.
Simply do this:
seqinfo(orfs) <- seqinfo(df.rfp) #df.rfp is the ORFik.experiment
Now rerun computeFeatures and it should work.
BTW, all subfunctions required for RAI is already in ORFik.
It is quite trivial to remake, but I can add it on the wish list as a function implemented.
If you wanted to make a mock for yourself, just define RAI: x_i (count of ORF > 10 for sample i), y_i (translation status: use FLOSS, RRS, and ORFscore, set cutoff from cds 90% included for sample i) / x_i
No magic.
hi, thanks again for all your responses!
I've tried adding the seqinfo.
This is what I did
organism <- "Homo sapiens"
paired_end <- FALSE # Set to TRUE if your data is paired-end
replicates <- c(1, 2, 3, 1, 2, 3)
conditions <- rep(c("CTR", "PM25"), each = 3)
conf <- config.exper(experiment = "GSE",
assembly = "Homo_sapiens_GRCh38_103",
type = "Ribo-seq")
# Create the experiment
create.experiment(
bam_dir,
exper = experiment_name,
fa = genome_file,
txdb = gtf_file,
organism = organism,
rep = replicates,
condition = conditions
)
df.rfp <- read.experiment("GSE")
ORFikQC(df.rfp, complex.correlation.plots = FALSE)
shiftFootprintsByExperiment(df.rfp)
orf_summary <- read.csv('/home/zs-ans/rrs/orf_summary.tsv', sep = "\t", nrows=5)
orf_summary$samples <- strsplit(paste(orf_summary$ribocode_hit_samples, orf_summary$ribotricer_hit_samples, sep = ","), ",")
orf_summary$start <- lapply(orf_summary$location, function(x) strsplit(strsplit(x, ":")[[1]][2], "-")[[1]][1])
orf_summary$end <- lapply(orf_summary$location, function(x) strsplit(strsplit(x, ":")[[1]][2], "-")[[1]][2])
orf_summary <- separate_rows(orf_summary, "transcript_id", "transcript_type", sep = ",")
sample_name<-'PM25_3'
orf_summary <- orf_summary[sapply(orf_summary$samples, FUN = function(x) is.element(sample_name, x)), ]
orf_id <- paste(orf_summary$transcript_id, as.numeric(as.factor(orf_summary$ORF_ID)), sep = "_") # Each orf gets a number, such that all exons of same orf have the same number
orf_summary$ORF_ID<-orf_id
orfs<-makeGRangesListFromDataFrame(orf_summary, seqnames.field= 'chrom', split.field = 'ORF_ID')
chromosomes <- intersect(seqinfo(orfs),seqinfo(df.rfp))
seqinfo(orfs) <- chromosomes
dt <- computeFeatures(orfs,
RFP = fimport(filepath(df.rfp[6,], "pshifted")), Gtf = df.rfp@txdb, faFile = df.rfp@fafile,
weight.RFP = "score")
This is the error I'm facing now using the computeFeatures function but it didn't happened in RRS calculation
Error in pmapToTranscriptF(grl, reference, ignore.strand = ignore.strand, : Invalid ranges to map, check them. One has width bigger than its reference
Thanks!
Been sick a few days, so could not respond before now.
Hm, you have a annotation to seqinfo mismatch (I.e. a sequence that goes outside the chromosome boundary / or transcript boundary)
So I have an idea of what it could be, did you create the ORF list using a different genome/ annotation file?
Hi! thanks so much for your response I've been analyzing the package and its awesome! Also I hope you are feeling better!
1) I've rerun my orf detection to make sure I was using the same files and I am. 2) Yes, I am using the Homo_sapiens.GRCh38.dna.primary_assembly.fa file as the human genome file. 3) First I get the seqinfo of orfs and df.rfp for the last sample. Then i run the intersect. Once i calculate the intersect i replace the seqinfo of orfs.
seqinfo(orfs):
seqinfo(orfs)
Seqinfo object with 26 sequences from an unspecified genome; no seqlengths:
seqlengths isCircular genome
1 NA NA
4.seqlevels(orfs, pruning.mode = "coarse") <- seqlevels(intersect) seqinfo(orfs) <- intersect
Again thank you so much for your help! This is tremendous help for me!
Ok, then I think it is the gtf or orfs,
The problem you have a range trying to map outside the chromosome boundary, so something has the wrong ranges, that is what we need to figure out, what is it.
Hi, thanks for your response!
translate(ORFik:::txSeqsFromFa(orfs, df.rfp@fafile, TRUE, TRUE))
AAStringSet object of length 13:
width seq names
[1] 2718 MTPYEGKDSVLRRRTPGGFYILS...LLGLQPLLTRSSTC*GASRR ENST00000347370_4
[2] 4262 MSSTSSKRAPTTATQRLKQDYLR...CDSWVCSLCLHGQVRAEEHRAG ENST00000349431_5
[3] 4262 MSSTSSKRAPTTATQRLKQDYLR...CDSWVCSLCLHGQVRAEEHRAG ENST00000360466_5
[4] 2718 MTPYEGKDSVLRRRTPGGFYILS...L*LGLQPLLTRSSTCGASRR ENST00000400929_4
[5] 4262 MSSTSSKRAPTTATQRLKQDYLR...CDSWVCSLCLHGQVRAEEHRAG ENST00000400930_6
... ... ...
[9] 2718 MTPYEGKDSVLRRRTPGGFYILS...LLGLQPLLTRSSTCGASRR ENST00000450390_4
[10] 4249 MPEIRVTPLGEWEPPGGLWGGLR...DEELGSFLTSLLKKGLPQAPS ENST00000618806_1
[11] 143 MLAGNEFQVSLSSSMSVSELKAQ...PLEDQLPLGEYGLKPLSTVFMN ENST00000624652_9
[12] 157 MLAGNEFQVSLSSSMSVSELKAQ...PLSTVFMNLRLRGGGTEPGGRS ENST00000624697_10
[13] 300 MVRQMSQVGGGGLCASQFSSPSP...SPAPCSICACGEAAQSLAGG ENST00000649529_8
Warning messages:
1: In .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet], :
in 'x[[1]]': last 2 bases were ignored
2: In .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet], :
in 'x[[2]]': last base was ignored
3: In .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet], :
in 'x[[3]]': last base was ignored
4: In .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet], :
in 'x[[4]]': last 2 bases were ignored
5: In .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet], :
in 'x[[5]]': last base was ignored
6: In .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet], :
in 'x[[8]]': last base was ignored
7: In .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet], :
in 'x[[9]]': last 2 bases were ignored
8: In .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet], :
in 'x[[11]]': last 2 bases were ignored
9: In .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet], :
in 'x[[13]]': last 2 bases were ignored
Again thank you so much for all your help!
Can you try to run without the orf that hits the flank ? Also from translate call it gives warning, your orfs are not made of triplets, also you see none of them end with a stop codon, they were defined from start and up to only first nucleotide of second last codon ? Looks strange to me
Hi, after investigating the data I found two cases that i think are interesting.
This is an example of an orf where i can run the compute features function: orf id: ENST00000624697_1 transcript_id: ENST00000624697 start:1014005 end: 1014475
Reference: transcript_id: ENST00000624697 feature start end transcript 1001138 1014540 exon 1001138 1001281 exon 1008194 1008279 exon 1013984 1014540 CDS 1014005 1014475 start_codon 1014005 1014007 stop_codon 1014476 1014478 five_prime_utr 1001138 1001281 five_prime_utr 1008194 1008279 five_prime_utr 1013984 1014004 three_prime_utr 1014479 1014540
Also on this i checked the width that is the part I'm getting the error and the values are [["ENST00000624697_1"]] 471 [["ENST00000624697"]] 144 86 557
Now here is an example of an orf that gives me this error: Error in pmapToTranscriptF(grl, reference, ignore.strand = ignore.strand, : Invalid ranges to map, check them. One has width bigger than its reference orf id: ENST00000450390_1 transcript_id: ENST00000450390 start:1255206 end: 1263361
Reference transcript_id: ENST00000450390 feature start end transcript 1253909 1273853 exon 1273666 1273853 exon 1267862 1267992 CDS 1267864 1267992 start_codon 1267990 1267992 exon 1266098 1266290 stop_codon 1267862 1267863 stop_codon 1266290 1266290 exon 1263346 1263386 exon 1257208 1257310 exon 1256992 1257130 exon 1256045 1256125 exon 1253909 1255487 five_prime_utr 1273666 1273853 three_prime_utr 1266098 1266289 three_prime_utr 1263346 1263386 three_prime_utr 1257208 1257310 three_prime_utr 1256992 1257130 three_prime_utr 1256045 1256125 three_prime_utr 1253909 1255487
width:
[["ENST00000450390_1"]] 8156 [["ENST00000450390"]] 188 131 193 41 103 139 81 1579
Thank you so much!!
Yeah, there you have it. The last ORF (ENST00000450390_1) is over 8k long, while the full transcript is only 2k long. Which is not possible.. Where did you get names from? Maybe it is a different isoform which is correct ?
Hi! Thanks for your response! After carefully analyzing the data, I've found that the coordinates i was using contained introns and were not just the exonic regions of the orf.
Now I have changed that and my orfs Granges looks like this
orfs
GRangesList object of length 67654:
$ENST00000000233_1
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 7 127589083-127589163 +
-------
seqinfo: 26 sequences from an unspecified genome; no seqlengths
$ENST00000000233_2
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 7 127589485-127589594 +
-------
seqinfo: 26 sequences from an unspecified genome; no seqlengths
$ENST00000000233_3
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 7 127590066-127590137 +
-------
seqinfo: 26 sequences from an unspecified genome; no seqlengths
...
<67651 more elements>
> seqinfo(orfs)
Seqinfo object with 26 sequences from an unspecified genome; no seqlengths:
seqnames seqlengths isCircular genome
1 <NA> <NA> <NA>
2 <NA> <NA> <NA>
3 <NA> <NA> <NA>
4 <NA> <NA> <NA>
5 <NA> <NA> <NA>
... ... ... ...
22 <NA> <NA> <NA>
X <NA> <NA> <NA>
Y <NA> <NA> <NA>
MT <NA> <NA> <NA>
KI270728.1 <NA> <NA> <NA>
this is the seqinfo information of my aligment files
seqinfo(df.rfp[6,])
Seqinfo object with 194 sequences from an unspecified genome:
seqnames seqlengths isCircular genome
1 248956422 <NA> <NA>
10 133797422 <NA> <NA>
11 135086622 <NA> <NA>
12 133275309 <NA> <NA>
13 114364328 <NA> <NA>
... ... ... ...
KI270539.1 993 <NA> <NA>
KI270385.1 990 <NA> <NA>
KI270423.1 981 <NA> <NA>
KI270392.1 971 <NA> <NA>
KI270394.1 970 <NA> <NA>
I have tried several ways of intersecting them but i keep getting this error
dt <- computeFeatures(orfs,
+ RFP = fimport(filepath(df.rfp[6,],"pshifted")), Gtf = df.rfp@txdb, faFile = df.rfp@fafile,
+ weight.RFP = "score")
No RNA added, skipping feature te and fpkm of RNA, also ribosomeReleaseScore will also be not normalized best way possible.
Error: subscript contains invalid names
What would be the problem now and what is the best way of intersecting?
Just to make sure first, could you run this:
any(lengths(orfs) > 1) # Test to see there is an orf with 2 exons, if not it is most likely still wrong.
Then for your error: so "subscript contains invalid names" means that it tries a [] subsetting, but could not find that name.
To replicate try this:
library(GenomicRanges); a <- GRangesList(a = GRanges("1", 1)); a["b"] # "b" is not a name in GRL
My guess is that you have an orf that has a transcript name which is not in your annotation etc.
To speed up finding the bug, I can show you how to debug properly.
First run this:
# <- Define all variables needed
debug(computeFeatures)
dt <- computeFeatures(RFP = fimport(filepath(df.rfp[6,],"pshifted")), Gtf = df.rfp@txdb, faFile = df.rfp@fafile, weight.RFP = "score")
# Now in debug mode in console press n + enter for next
# and s + enter to step into a function, the first round figure out which sub function inside computeFeatures fails, then you can run:
undebug(computeFeatures) # Remove debug flag for computeFeatures
debug(function_that_fails) # And now run computeFeatures again
I am quite sure you will find which line fails and why from that :)
Hi, I have found the problem, the debugging instruction were really helpful!. It was that i have some orfs that do not have 5' UTR region in the reference. I proceeded to delete those but now I'm back at this error only when enabling uorfFeatures option: Error in pmapToTranscriptF(grl, reference, ignore.strand = ignore.strand, : Invalid ranges to map, check them. One has width bigger than its reference.
My reference data looks like this for a transcript id
and my orf table looks like this for the same transcript_id
Now, I've also analyzed the widths of those objects to se from where the error is comming. as you can see now i dont have orfs whose exonic regions are bigger than its reference but i think that there is a problem with the mapping.
xWidths IntegerList of length 65542 [["ENST00000000233_1"]] 81 [["ENST00000000233_2"]] 110 [["ENST00000000233_3"]] 72 [["ENST00000000233_4"]] 126 [["ENST00000000233_5"]] 84 [["ENST00000000412_1"]] 176 [["ENST00000000412_2"]] 167 [["ENST00000000412_3"]] 110 [["ENST00000000412_4"]] 131 [["ENST00000000412_5"]] 127 ...
<65532 more elements> > txWidths IntegerList of length 65542 [["ENST00000000233"]] 155 81 110 72 126 488 [["ENST00000000233"]] 155 81 110 72 126 488 [["ENST00000000233"]] 155 81 110 72 126 488 [["ENST00000000233"]] 155 81 110 72 126 488 [["ENST00000000233"]] 155 81 110 72 126 488 [["ENST00000000412"]] 158 177 167 110 131 127 1580 [["ENST00000000412"]] 158 177 167 110 131 127 1580 [["ENST00000000412"]] 158 177 167 110 131 127 1580 [["ENST00000000412"]] 158 177 167 110 131 127 1580 [["ENST00000000412"]] 158 177 167 110 131 127 1580 ... <65532 more elements> > xWidths[xWidths>=txWidths] IntegerList of length 65542 [["ENST00000000233_1"]] integer(0) [["ENST00000000233_2"]] integer(0) [["ENST00000000233_3"]] integer(0) [["ENST00000000233_4"]] integer(0) [["ENST00000000233_5"]] integer(0) [["ENST00000000412_1"]] 176 [["ENST00000000412_2"]] 167 [["ENST00000000412_3"]] integer(0) [["ENST00000000412_4"]] integer(0) [["ENST00000000412_5"]] integer(0) ... <65532 more elements> Again thank you so much!I see a big worry! I think you might not have any splicing information for the orfs ORFs.
So here is the problem: An ORF might span multiple exons, so if an ORF spans 2 exons it needs 2 GRanges for the object.
Like this:
> orfs <- GRangesList(ENST0000051251_1 = GRanges("1", IRanges(c(1, 5), width = 3), "+"))
> orfs
GRangesList object of length 1:
$ENST0000051251_1
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 1-3 +
[2] 1 5-7 +
You see this ORFs spans 2 exons (in this case the intron is 1 base at position 4).
RiboCode only gives transcript coordinates at column "ORF_tstart" and genomic at ORF_gstart
If you define ORFs directly from ORF_gstart you don't have the splicing information.
So the proper way is to take the "ORF_tstart" and map to genomic For full example, run this:
library(ORFik)
mrna_genomic <- GRangesList(ENST0000051251 = GRanges("1", IRanges(c(1, 20), width = 9), "+")) # The mRNAs, Intron at 10-19, for you load through this: mrna_genomic <- loadRegion(df_rfp, "tx")
orfs_starts <- c(4, 7) # tx_coordinate: in RiboCode: ORF_tstart
orf_widths <- c(6, 6) # 6 nt long ORFs, this is ORF_tstop - ORF_tstart + 1 (6 - 1 + 1 = 6)
orf_tx_parrent <- rep("ENST0000051251", 2) # Both from same tx
orfs_tx <- IRanges(orfs_starts, width = orf_widths) # In tx coordinates (2 ORFs)
tx_orf_matching <- data.table::chmatch(orf_tx_parrent, names(mrna_genomic)) # Match orf to tx index
orfs_tx <- split(orfs_tx, tx_orf_matching)
all_indices_found <- all(!is.na(names(orfs_tx)))
stopifnot(all_indices_found) # Then you have an ORF which is not from an mRNA
# Now map to genomic coordinates with splicing and give correct names, txid_1 etc
ORFik:::mapToGRanges(mrna_genomic, orfs_tx, groupByTx = FALSE)
Did this give you what you needed?
Also: So RiboCode output has the column "- ORF_type", which for uORFs are "uORF" and "Overlap_uORF". Are you sure you subsetted to those when settings uorfFeatures = TRUE ? Since computeFeatures only works for uORFs then, not the other types (Since it then requires 5' UTRs, which might not exist)
From your error above (an ORF from a transcript without 5' UTR, it sounds like you have all ORF types in your list). If so, just subset to uORF types and run again, then it should work :) So you can split your set in 2, the uORF set, which you can run with uORF features, and all the others. Which you can run without uORF features.
perefect! Thank you so much! I got it and I'm finally able to run this function, now my data looks like this orfs GRangesList object of length 5946: $ENST00000000233_1 GRanges object with 5 ranges and 0 metadata columns: seqnames ranges strand
Sure, I usually make these pseudo scores like this:
res <- computeFeatures(orfs, …) # Your normal calculation
no_trailer <- is.na(res$RRS) # Detect et
orfs_with_no_trailers <- orfs[no_trailer]
cds_no_trailers <- cds[txNames(orfs_with_no_trailers)]
cds_stop_no_trailers <- stopCodons(cds_no_trailers, TRUE)
pseudo_trailers <- extendTrailers(cds_stop_no_trailers, extension = 300) # 300 nucleotide pseudo trailers
res_no_trailer <- ribosomeReleaseScore(orfs_with_no_trailers, ribo_seq, pseudo_trailers, weight.RFP = "score")
# Now merge in RRS values from pseudo values
res[no_trailer, ]$RRS <- res_no_trailer
Hi, as always thank you so much for your response!! One question, for those orfs that I want to implement the extendTrailers function. Can I use the stop codons that are in the result from the computeFeatures function?
Hm, both stopCodons and extendTrailers are existing functions in ORFik, no need to reimplement it.
This code will just run, as long as you have orfs and cds as GRangesList objects :)
I mean for those orfs i don't have the anotated CDS regions so I cannot run this commands
cds_no_trailers <- cds[txNames(orfs_with_no_trailers)] cds_stop_no_trailers <- stopCodons(cds_no_trailers, TRUE)
That's why I'm trying to find another way to extract the stop codons.
Thanks!
If I understand you mean from "non coding transcripts". ? I.e. the transcript has no defined CDS ?
If so, the easiest way is to say anything downstream of orf is 3' utr. To do that just swap CDS with orfs for those. (Note: there is a corner case when 2 orfs are on the same non coding transcript, then the 3' utr of the first will overlap the downstream orf)
I.e. do:
Stops <- stopCodons(orfs) trailers <- extendTrailers(stops, downstream =Inf) # extend infinite = to end of transcript.
Did that work ?
Hi! thanks for your response. I couldn't find the downstream argument in the extendTrailers function. I tried it with the extension argument set to Inf but then I'm getting this error:
novel <- result[result$name %in% dt[is.na(dt$RRS)]$names,]
orfs_novel<-makeGRangesListFromDataFrame(novel, seqnames.field= 'seqname', split.field = 'name')
stops <- stopCodons(orfs_novel)
trailers <- extendTrailers(stops, extension = Inf)
Error in recycleArg(arg, argname, length.out) : 'width' contains NAs
In addition: Warning message:
In recycleIntegerArg(width, "width", length(x)) :
NAs introduced by coercion to integer range
stops
GRangesList object of length 32001:
$ENST00000158526_1
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] X 154450097-154450099 +
-------
seqinfo: 25 sequences from an unspecified genome; no seqlengths
$ENST00000216019_1
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 22 38492045-38492047 -
-------
seqinfo: 25 sequences from an unspecified genome; no seqlengths
$ENST00000216019_2
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 22 38492045-38492047 -
-------
seqinfo: 25 sequences from an unspecified genome; no seqlengths
...
<31998 more elements>
Hi! since I was not able to make the extendTrailers function worked I tried to manually implement your approach. So I create a Granges list with each transcript whitout the annotated 3 prime utr region. As start position I used the orf end position and as end position I used the end position of the transcript as you mentioned before.
My list now looks like this
> three_utrs_granges
GRangesList object of length 9209:
$ENST00000158526
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] X 154450099 +
-------
seqinfo: 25 sequences from an unspecified genome; no seqlengths
$ENST00000216019
GRanges object with 9 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 22 38494939-38506294 -
[2] 22 38494930-38506294 -
[3] 22 38494784-38506294 -
[4] 22 38494652-38506294 -
[5] 22 38494118-38506294 -
[6] 22 38494106-38506294 -
[7] 22 38487918-38506294 -
[8] 22 38486390-38506294 -
[9] 22 38486380-38506294 -
-------
seqinfo: 25 sequences from an unspecified genome; no seqlengths
$ENST00000216407
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 12 68474718-68474902 +
-------
seqinfo: 25 sequences from an unspecified genome; no seqlengths
...
<9206 more elements>
I ran RRS using three prime utrs regions and i got results. I wanted to check if the transcripts where i have multiple region due to having multiple orfs associated (example: ENST00000216019) I can take the samallest range as 3'UTR to make sure I dont overlap with any orf.
Thank you so much!!
Yes, good. This should work.
As a note extendTrailers was the wrong function, it was: stopRegion(orfs, tx, upstream = 0, downstream = Inf)
You could test that to validate it gives the same result.
Let me know how it goes and then I will close the issue :)
Problem I am currently working on calculating the ribosome release score for a set of ORFs (Open Reading Frames) using Riboseq reads. To make the runtime more efficient, I have been filtering the BAM files and the list of ORFs to only include the regions of interest. However, when I tested this approach using a subset of chromosome 1 and compared the results to using the whole BAM file, the results were not consistent.
Background The ribosome release score calculation involves processing large amounts of data, which can be computationally intensive. To speed up the process, I decided to work with only the relevant regions to reduce the workload. However, I encountered discrepancies in the results between the subset and the whole dataset.
Questions I'm seeking advice on two main points:
Improving Runtime: I'm looking for ideas on how to optimize the runtime of the ribosome release score calculation without necessarily subsetting the data. Are there any strategies or tools that can help streamline the computation while working with the complete dataset?
Considerations for Subsetting: When working with subsets of the data, what are the key considerations to ensure that the results remain consistent with the full dataset? Are there any best practices or common pitfalls to be aware of when subsetting data for computational analysis?
Additional Information I'm using Riboseq reads data. I have already attempted to compare the results between using a subset of chromosome 1 and the whole dataset for specific ORFs. The goal is to ensure that the runtime is optimized while maintaining the integrity and accuracy of the results.