Roleren / ORFik

MIT License
32 stars 9 forks source link

RRS Score using subset bam file #150

Closed josefinaperalba closed 11 months ago

josefinaperalba commented 1 year ago

Problem I am currently working on calculating the ribosome release score for a set of ORFs (Open Reading Frames) using Riboseq reads. To make the runtime more efficient, I have been filtering the BAM files and the list of ORFs to only include the regions of interest. However, when I tested this approach using a subset of chromosome 1 and compared the results to using the whole BAM file, the results were not consistent.

Background The ribosome release score calculation involves processing large amounts of data, which can be computationally intensive. To speed up the process, I decided to work with only the relevant regions to reduce the workload. However, I encountered discrepancies in the results between the subset and the whole dataset.

Questions I'm seeking advice on two main points:

Improving Runtime: I'm looking for ideas on how to optimize the runtime of the ribosome release score calculation without necessarily subsetting the data. Are there any strategies or tools that can help streamline the computation while working with the complete dataset?

Considerations for Subsetting: When working with subsets of the data, what are the key considerations to ensure that the results remain consistent with the full dataset? Are there any best practices or common pitfalls to be aware of when subsetting data for computational analysis?

Additional Information I'm using Riboseq reads data. I have already attempted to compare the results between using a subset of chromosome 1 and the whole dataset for specific ORFs. The goal is to ensure that the runtime is optimized while maintaining the integrity and accuracy of the results.

Roleren commented 1 year ago

Hey, Ill send what you need tomorrow morning

josefinaperalba commented 1 year ago

Have you also ran an analysis to do QC on the results that you can share?

Roleren commented 1 year ago

So, ORFik is highly optimized for speed, there is no point in subsetting bam files. That is why we have made a format called "ofst", based on the fst format from facebook. To convert bam files to ofst, see tutorial linked bellow (ofst will load entire file in < 1 second, compared to bam)

A note is you only run feature selection like RRS on p-site shifted reads! Not on the bam files.

So, First create ORFik.experiment

Then run:

Here is full tutorial script for ribo-seq: https://bioconductor.org/packages/release/bioc/vignettes/ORFik/inst/doc/Ribo-seq_pipeline.html If you have already aligned to bam files, start from this line: # Create experiment (Starting point if alignment is finished)

For section where I get RRS scores, I do this using the ORFik feature wrapping function called computeFeatures, which is found in tutorial in section: # Feature table (From WT rep 1)

I you have any questions, just ask

josefinaperalba commented 1 year ago

Hi, my objective is calculate the RRS score for different orfs that are already detected by another pipeline.

Until now I've been calculating the RRS with the script that i leave below. I realize now that I missing the pshifting section for the bam files, is that mandatory? is my script okay and I can save my memory requirement issue by changing from bam to ofst? Thanks!

Script:

#!/usr/bin/env Rscript

# Check if required arguments are provided
if (is.null(opt$samples) || is.null(opt$orfs) || is.null(opt$output) || is.null(opt$gtf)) {
  stop("Required arguments are missing", call. = FALSE)
}

# Load required libraries with suppressed messages
suppressMessages(library(ORFik))
suppressMessages(library(GenomicAlignments))
suppressMessages(library(tidyr))
suppressMessages(library(parallel))

# Read the sample sheet
sample_sheet <- read.csv(opt$samples, header = TRUE)

# Load ORFs summary
orf_summary <- read.csv(opt$orfs, sep = "\t")
orf_summary$samples <- strsplit(paste(orf_summary$ribocode_hit_samples, orf_summary$ribotricer_hit_samples, sep = ","), ",")
orf_summary$start <- lapply(orf_summary$location, function(x) strsplit(strsplit(x, ":")[[1]][2], "-")[[1]][1])
orf_summary$end <- lapply(orf_summary$location, function(x) strsplit(strsplit(x, ":")[[1]][2], "-")[[1]][2])

# Load reference GTF
reference_gtf <- read.csv(opt$gtf, sep = "\t", comment.char = "#", header = FALSE, col.names = c("seqname", "source", "feature", "start", "end", "score", "strand", "frame", "attribute"))

# Function to compute RRS for a single sample
compute_RRS <- function(sample_name, bam_file, gtf_file, orf_summary) {
    local_reference_gtf <- reference_gtf

    # Filter for the current sample
    sample_orf_summary <- orf_summary[sapply(orf_summary$samples, FUN = function(x) is.element(sample_name, x)), ]

    # Load Ribo-Seq reads
    riboseq_reads <- readGAlignments(bam_file, index = paste0(bam_file, ".bai"), use.names = TRUE)

    # Extract 3' UTRs from reference GTF
    three_utrs <- local_reference_gtf[local_reference_gtf$feature == "three_prime_utr", ]
    three_utrs$transcript_id <- lapply(three_utrs$attribute, function(x) sub(";.*", "", sub(".*transcript_id ", "", x)))

    # Filter out chromosomes that are not present in all datasets
    chromosomes <- Reduce(intersect, list(sample_orf_summary$chrom, unique(seqnames(riboseq_reads)), three_utrs$seqname))
    sample_orf_summary <- sample_orf_summary[sample_orf_summary$chrom %in% chromosomes, ]
    riboseq_reads <- riboseq_reads[seqnames(riboseq_reads) %in% chromosomes]
    three_utrs <- three_utrs[three_utrs$seqname %in% chromosomes, ]

    # Match ORFs and 3' UTRs
    sample_orf_summary$granges <- lapply(1:nrow(sample_orf_summary), function(i) {
       GRanges(seqnames = sample_orf_summary[i, "chrom"],
               ranges = IRanges(start = as.numeric(sample_orf_summary[i, "start"]),
                                 end = as.numeric(sample_orf_summary[i, "end"])),
               strand = sample_orf_summary[i, "strand"])
    })    

    three_utrs$granges <- lapply(1:nrow(three_utrs), function(i) {
       GRanges(seqnames = three_utrs[i, "seqname"],
               ranges = IRanges(start = as.numeric(three_utrs[i, "start"]),
                                 end = as.numeric(three_utrs[i, "end"])),
               strand = three_utrs[i, "strand"])
    })

    orfs <- separate_rows(sample_orf_summary, "transcript_id", "transcript_type", sep = ",")
    merged <- merge(x = orfs, y = three_utrs, by.x = "transcript_id", by.y = "transcript_id")

    # Create ORF GRangesList
    grl <- GRangesList(merged$granges.x)
    names(grl) <- merged$transcript_id

    # Create 3' UTR GRangesList
    three_utrs_granges <- GRangesList(merged$granges.y)
    names(three_utrs_granges) <- merged$transcript_id

    # Compute RRS
    rrs <- ribosomeReleaseScore(grl = grl, RFP = riboseq_reads, GtfOrThreeUtrs = three_utrs_granges)

    # Return a data frame with ORF ID and RRS score
    result <- data.frame(orf_id = merged$ORF_ID, rrs = rrs, sample_name = sample_name)
    result_maxrrs<- setNames(aggregate(result$rrs, by = list(result$orf_id), FUN = max), c("orf_id", "max_rrs"))
    result <- merge(result, result_maxrrs, by = "orf_id", all.x = TRUE)
    result <- result[result$rrs == result$max_rrs, ]
    result$row_num <- ave(result$rrs, result$orf_id, FUN = seq_along)
    result <- result[result$row_num == 1, ]
    result$max_rrs <- NULL
    result$row_num <- NULL
    return(result)
}

# Initialize a cluster for parallel processing
cl <- makeCluster(opt$cores)
clusterExport(cl, c("GRanges", "IRanges"))
sample_sheet_list <- split(sample_sheet, seq(nrow(sample_sheet)))

# Create a function to compute RRS in parallel
compute_RRS_single_row <- function(row) {
    compute_RRS(row$sample, row$bam, reference_gtf, orf_summary)
}

# Use mclapply to compute RRS scores in parallel
result_list <- mclapply(sample_sheet_list, compute_RRS_single_row, mc.cores = opt$cores)

# Close the cluster
stopCluster(cl)

# Combine RRS results into a single data frame
combined_results <- do.call(rbind, result_list)

# Write the combined RRS results to the output file
write.table(combined_results, opt$output, row.names = FALSE, sep = "\t")

cat("All done!\n")
Roleren commented 1 year ago

Yes, I would say the pshifting is mandatory (your results will be a bit inaccurate without it). Since bam files have ~ 30nt fragments, the RRS score will hit an orf as long as the fragment hits. While with shifting, it only hits if the p-site is on the orf. Also for memory etc, ofst is much more efficient.

Which ORF prediction algorithm did you use if you do not have p-shifted fragments may I ask ?

Why do you load you gtf like that ? If you have a proper gtf, see this tutorial for how to load 3' UTRs etc in ORFik: https://bioconductor.org/packages/release/bioc/vignettes/ORFik/inst/doc/Importing_Data.html

Some notes:

I advice you to read the ORFik tutorials and making ofst files, and send me questions you have here. As I believe your current method will contain errors and strange results :)

If this is a one time analysis which is not to be published, then I would think your current script is "nearly ok".

josefinaperalba commented 1 year ago

I want to calculate the scores at an orf level. I have a tsv file with all the information neccesary to create a Granges of orfs. How can include that in the experiment if i cannot use that directly into the computeFeatures function?

Roleren commented 1 year ago

So first the error you sent: Error in covRleFromGR(x, weight = weight, ignore.strand = ignore.strand): Seqlengths of x contains NA values! To fix this, please update the gtf to a proper formated txdb (it complains on your gtf definition), with this:

gtf_file <- "/opt/projects/1357_BIP/2022-03_Riboseq_pipeline/data/references/annotations/103_from_BI/Homo_sapiens.GRCh38.103.gtf"
makeTxdbFromGenome(gtf_file, genome_file, "Homo sapiens", optimize = TRUE)
# Then update new path for experiment
gtf_file <- "/opt/projects/1357_BIP/2022-03_Riboseq_pipeline/data/references/annotations/103_from_BI/Homo_sapiens.GRCh38.103.gtf.db"
# Now create experiment again with valid txdb:
create.experiment(
bam_dir,
exper = experiment_name,
fa = genome_file,
txdb = gtf_file,
organism = organism,
rep = replicates,
condition = conditions
)
# Finally re-run
ORFikQC(df, create.ofst = FALSE) # create.ofst = FALSE, To avoid loading bam files again, you already succeded in making ofst

Second question:

For compute features you need a grangeslist, so first you need to do is read in the tsv and convert to GRangesList.

so something like:

df <- read.experiment("GSE")
orfs_path <- "path/to/tsv"
orfs <- rtracklayer::import(orfs_path) # <- this usually loads in from tsv, but if not let me know, that means you need a custom converter, which I can help you with.

computeFeatures(orfs, RFP = rfp_lib, RNA = NULL, Gtf = df@txdb, faFile = df@fafile, weight.RFP = score(rfp_lib))
josefinaperalba commented 1 year ago

Perfect! Thanks you so much for your help. I have a doubt, this function handles automatically the matching between my orfs and the 3' UTR regions in the reference file for RRS calculation? What happens if an orf doesn't match to a 3' UTR region?

Roleren commented 1 year ago

Ah, yes, forgot to mention that.

So the orfs must be named txname_id

So like first orf of transcript named ENST0000015123, can be named: "ENST0000015123_1" etc. Then it will automatically match.

You can also call them all the tx id, without _1, _2 etc :)

Roleren commented 1 year ago

Also if it does not work, you can try to load the tsv, create a GRanges Object from scratch and split on ORF id:

Pseudo code:

orfs <- data.table::fread("tsv_file")
orfs <- GRanges(seqnames = orfs$chromosome, start = orfs$start, end = orfs$end, strand = orfs$strand, tx_id = orfs$tx_id, orf_id = orfs$orf_id)
# If you do not have proper orf_id make it like this:
orf_id <- paste(orfs$tx_id, as.numeric(as.factor(orfs$orf_id))), sep = "_") # Each orf gets a number, such that all exons of same orf have the same number
orfs <- split(orfs, orfs$orf_id) # orfs_id must be made from (tx_id)_(orf_number), then it will work.
josefinaperalba commented 1 year ago

Hi thanks so much for your help. Until now with your help i was able to calculate the RRS score using the ribosomeRealeaseScore function but I'm not able to implement the compute features function since I get this error. computeFeatures(orfs, RFP = fimport(filepath(df.rfp[6,], "pshifted")), Gtf = df.rfp@txdb, faFile = df.rfp@fafile, weight.RFP = "score") No RNA added, skipping feature te and fpkm of RNA, also ribosomeReleaseScore will also be not normalized best way possible. Error in covRleFromGR(x, weight = weight, ignore.strand = ignore.strand) : Seqlengths of x contains NA values!

Thanks!!

josefinaperalba commented 1 year ago

On another hand, by any chance there are any plans to incorporate the RAI score from this paper 10.1186/s12864-018-4765-z to the library??

Thank you!

Roleren commented 1 year ago

Yes, I need to make that error more clear I see.

So what it is you did not add the seqinfo to the "orfs" object.

Simply do this:

seqinfo(orfs) <- seqinfo(df.rfp) #df.rfp is the ORFik.experiment

Now rerun computeFeatures and it should work.

Roleren commented 1 year ago

BTW, all subfunctions required for RAI is already in ORFik.

It is quite trivial to remake, but I can add it on the wish list as a function implemented.

If you wanted to make a mock for yourself, just define RAI: x_i (count of ORF > 10 for sample i), y_i (translation status: use FLOSS, RRS, and ORFscore, set cutoff from cds 90% included for sample i) / x_i

No magic.

josefinaperalba commented 1 year ago

hi, thanks again for all your responses!

I've tried adding the seqinfo.

This is what I did

organism <- "Homo sapiens" 
paired_end <- FALSE  # Set to TRUE if your data is paired-end
replicates <- c(1, 2, 3, 1, 2, 3)
conditions <- rep(c("CTR", "PM25"), each = 3)
conf <- config.exper(experiment = "GSE",
                     assembly = "Homo_sapiens_GRCh38_103",
                     type = "Ribo-seq")

# Create the experiment
create.experiment(
  bam_dir,
  exper = experiment_name,
  fa = genome_file,
  txdb = gtf_file,
  organism = organism,
  rep = replicates,
  condition = conditions
)

df.rfp <- read.experiment("GSE")

ORFikQC(df.rfp, complex.correlation.plots = FALSE)

shiftFootprintsByExperiment(df.rfp)
orf_summary <- read.csv('/home/zs-ans/rrs/orf_summary.tsv', sep = "\t", nrows=5)
orf_summary$samples <- strsplit(paste(orf_summary$ribocode_hit_samples, orf_summary$ribotricer_hit_samples, sep = ","), ",")
orf_summary$start <- lapply(orf_summary$location, function(x) strsplit(strsplit(x, ":")[[1]][2], "-")[[1]][1])
orf_summary$end <- lapply(orf_summary$location, function(x) strsplit(strsplit(x, ":")[[1]][2], "-")[[1]][2])
orf_summary <- separate_rows(orf_summary, "transcript_id", "transcript_type", sep = ",")
sample_name<-'PM25_3'
orf_summary <- orf_summary[sapply(orf_summary$samples, FUN = function(x) is.element(sample_name, x)), ]
orf_id <- paste(orf_summary$transcript_id, as.numeric(as.factor(orf_summary$ORF_ID)), sep = "_") # Each orf gets a number, such that all exons of same orf have the same number
orf_summary$ORF_ID<-orf_id
orfs<-makeGRangesListFromDataFrame(orf_summary, seqnames.field= 'chrom', split.field = 'ORF_ID')
chromosomes <- intersect(seqinfo(orfs),seqinfo(df.rfp))
seqinfo(orfs) <- chromosomes
dt <- computeFeatures(orfs,
                RFP = fimport(filepath(df.rfp[6,], "pshifted")), Gtf = df.rfp@txdb, faFile = df.rfp@fafile,
                weight.RFP = "score")

This is the error I'm facing now using the computeFeatures function but it didn't happened in RRS calculation

Error in pmapToTranscriptF(grl, reference, ignore.strand = ignore.strand, : Invalid ranges to map, check them. One has width bigger than its reference

Thanks!

Roleren commented 1 year ago

Been sick a few days, so could not respond before now.

Hm, you have a annotation to seqinfo mismatch (I.e. a sequence that goes outside the chromosome boundary / or transcript boundary)

So I have an idea of what it could be, did you create the ORF list using a different genome/ annotation file?

  1. So that the coordinates are wrong ?
  2. Are you using the ensemble v103 gtf/fasta genome for human btw?
  3. Can you send full output of seqinfo(orfs) and seqinfo(df.rfp) and the intersect ?
josefinaperalba commented 1 year ago

Hi! thanks so much for your response I've been analyzing the package and its awesome! Also I hope you are feeling better!

1) I've rerun my orf detection to make sure I was using the same files and I am. 2) Yes, I am using the Homo_sapiens.GRCh38.dna.primary_assembly.fa file as the human genome file. 3) First I get the seqinfo of orfs and df.rfp for the last sample. Then i run the intersect. Once i calculate the intersect i replace the seqinfo of orfs.

  1. seqinfo(orfs): seqinfo(orfs) Seqinfo object with 26 sequences from an unspecified genome; no seqlengths: seqlengths isCircular genome 1 NA NA 2 NA NA 3 NA NA 4 NA NA 5 NA NA 6 NA NA 7 NA NA 8 NA NA 9 NA NA 10 NA NA 11 NA NA 12 NA NA 13 NA NA 14 NA NA 15 NA NA 16 NA NA 17 NA NA 18 NA NA 19 NA NA 20 NA NA 21 NA NA 22 NA NA X NA NA Y NA NA MT NA NA KI270728.1 NA NA 2.seqinfo(df.rfp[6,]) Seqinfo object with 194 sequences from an unspecified genome: seqlengths isCircular genome 1 248956422 NA 10 133797422 NA 11 135086622 NA 12 133275309 NA 13 114364328 NA 14 107043718 NA 15 101991189 NA 16 90338345 NA 17 83257441 NA 18 80373285 NA 19 58617616 NA 2 242193529 NA 20 64444167 NA 21 46709983 NA 22 50818468 NA 3 198295559 NA 4 190214555 NA 5 181538259 NA 6 170805979 NA 7 159345973 NA 8 145138636 NA 9 138394717 NA MT 16569 NA X 156040895 NA Y 57227415 NA KI270728.1 1872759 NA KI270727.1 448248 NA KI270442.1 392061 NA KI270729.1 280839 NA GL000225.1 211173 NA KI270743.1 210658 NA GL000008.2 209709 NA GL000009.2 201709 NA KI270747.1 198735 NA KI270722.1 194050 NA GL000194.1 191469 NA KI270742.1 186739 NA GL000205.2 185591 NA GL000195.1 182896 NA KI270736.1 181920 NA KI270733.1 179772 NA GL000224.1 179693 NA GL000219.1 179198 NA KI270719.1 176845 NA GL000216.2 176608 NA KI270712.1 176043 NA KI270706.1 175055 NA KI270725.1 172810 NA KI270744.1 168472 NA KI270734.1 165050 NA GL000213.1 164239 NA GL000220.1 161802 NA KI270715.1 161471 NA GL000218.1 161147 NA KI270749.1 158759 NA KI270741.1 157432 NA GL000221.1 155397 NA KI270716.1 153799 NA KI270731.1 150754 NA KI270751.1 150742 NA KI270750.1 148850 NA KI270519.1 138126 NA GL000214.1 137718 NA KI270708.1 127682 NA KI270730.1 112551 NA KI270438.1 112505 NA KI270737.1 103838 NA KI270721.1 100316 NA KI270738.1 99375 NA KI270748.1 93321 NA KI270435.1 92983 NA GL000208.1 92689 NA KI270538.1 91309 NA KI270756.1 79590 NA KI270739.1 73985 NA KI270757.1 71251 NA KI270709.1 66860 NA KI270746.1 66486 NA KI270753.1 62944 NA KI270589.1 44474 NA KI270726.1 43739 NA KI270735.1 42811 NA KI270711.1 42210 NA KI270745.1 41891 NA KI270714.1 41717 NA KI270732.1 41543 NA KI270713.1 40745 NA KI270754.1 40191 NA KI270710.1 40176 NA KI270717.1 40062 NA KI270724.1 39555 NA KI270720.1 39050 NA KI270723.1 38115 NA KI270718.1 38054 NA KI270317.1 37690 NA KI270740.1 37240 NA KI270755.1 36723 NA KI270707.1 32032 NA KI270579.1 31033 NA KI270752.1 27745 NA KI270512.1 22689 NA KI270322.1 21476 NA GL000226.1 15008 NA KI270311.1 12399 NA KI270366.1 8320 NA KI270511.1 8127 NA KI270448.1 7992 NA KI270521.1 7642 NA KI270581.1 7046 NA KI270582.1 6504 NA KI270515.1 6361 NA KI270588.1 6158 NA KI270591.1 5796 NA KI270522.1 5674 NA KI270507.1 5353 NA KI270590.1 4685 NA KI270584.1 4513 NA KI270320.1 4416 NA KI270382.1 4215 NA KI270468.1 4055 NA KI270467.1 3920 NA KI270362.1 3530 NA KI270517.1 3253 NA KI270593.1 3041 NA KI270528.1 2983 NA KI270587.1 2969 NA KI270364.1 2855 NA KI270371.1 2805 NA KI270333.1 2699 NA KI270374.1 2656 NA KI270411.1 2646 NA KI270414.1 2489 NA KI270510.1 2415 NA KI270390.1 2387 NA KI270375.1 2378 NA KI270420.1 2321 NA KI270509.1 2318 NA KI270315.1 2276 NA KI270302.1 2274 NA KI270518.1 2186 NA KI270530.1 2168 NA KI270304.1 2165 NA KI270418.1 2145 NA KI270424.1 2140 NA KI270417.1 2043 NA KI270508.1 1951 NA KI270303.1 1942 NA KI270381.1 1930 NA KI270529.1 1899 NA KI270425.1 1884 NA KI270396.1 1880 NA KI270363.1 1803 NA KI270386.1 1788 NA KI270465.1 1774 NA KI270383.1 1750 NA KI270384.1 1658 NA KI270330.1 1652 NA KI270372.1 1650 NA KI270548.1 1599 NA KI270580.1 1553 NA KI270387.1 1537 NA KI270391.1 1484 NA KI270305.1 1472 NA KI270373.1 1451 NA KI270422.1 1445 NA KI270316.1 1444 NA KI270340.1 1428 NA KI270338.1 1428 NA KI270583.1 1400 NA KI270334.1 1368 NA KI270429.1 1361 NA KI270393.1 1308 NA KI270516.1 1300 NA KI270389.1 1298 NA KI270466.1 1233 NA KI270388.1 1216 NA KI270544.1 1202 NA KI270310.1 1201 NA KI270412.1 1179 NA KI270395.1 1143 NA KI270376.1 1136 NA KI270337.1 1121 NA KI270335.1 1048 NA KI270378.1 1048 NA KI270379.1 1045 NA KI270329.1 1040 NA KI270419.1 1029 NA KI270336.1 1026 NA KI270312.1 998 NA KI270539.1 993 NA KI270385.1 990 NA KI270423.1 981 NA KI270392.1 971 NA KI270394.1 970 NA 3.intersect: intersect Seqinfo object with 26 sequences from an unspecified genome: seqlengths isCircular genome 1 248956422 NA 2 242193529 NA 3 198295559 NA 4 190214555 NA 5 181538259 NA 6 170805979 NA 7 159345973 NA 8 145138636 NA 9 138394717 NA 10 133797422 NA 11 135086622 NA 12 133275309 NA 13 114364328 NA 14 107043718 NA 15 101991189 NA 16 90338345 NA 17 83257441 NA 18 80373285 NA 19 58617616 NA 20 64444167 NA 21 46709983 NA 22 50818468 NA X 156040895 NA Y 57227415 NA MT 16569 NA KI270728.1 1872759 NA

    4.seqlevels(orfs, pruning.mode = "coarse") <- seqlevels(intersect) seqinfo(orfs) <- intersect

    Again thank you so much for your help! This is tremendous help for me!

Roleren commented 1 year ago

Ok, then I think it is the gtf or orfs,

The problem you have a range trying to map outside the chromosome boundary, so something has the wrong ranges, that is what we need to figure out, what is it.

  1. Send me the name for the gtf too (you only sent fasta genome name), is this from ensemble too?
  2. How did you create the list of ORFs (from another genome, maybe one of the orfs then have ranges outside the ensemble definition?) Check that max coordinate of orf is never outside the seqname max, also an easy way is to run translate(ORFik:::txSeqsFromFa(orfs, df@fafile, TRUE, TRUE)), and see that all the orfs start with start codon and end with stop codon.
josefinaperalba commented 1 year ago

Hi, thanks for your response!

  1. this is the GTF I'm using Homo_sapiens.GRCh38.103.gtf
  2. I used this reference to create th list of orfs by using ribotricer (https://github.com/smithlabcode/ribotricer) and ribocode (https://github.com/xryanglab/RiboCode). I've checked the end position of each of my orfs and the end position of the corresponding transcript and in only one case the positions are the same. in the other ones the coordinates from the orfs are smaller. At last, I ran the command you gave and this is the result with warnings

translate(ORFik:::txSeqsFromFa(orfs, df.rfp@fafile, TRUE, TRUE)) AAStringSet object of length 13: width seq names
[1] 2718 MTPYEGKDSVLRRRTPGGFYILS...LLGLQPLLTRSSTC*GASRR ENST00000347370_4 [2] 4262 MSSTSSKRAPTTATQRLKQDYLR...CDSWVCSLCLHGQVRAEEHRAG ENST00000349431_5 [3] 4262 MSSTSSKRAPTTATQRLKQDYLR...CDSWVCSLCLHGQVRAEEHRAG ENST00000360466_5 [4] 2718 MTPYEGKDSVLRRRTPGGFYILS...L*LGLQPLLTRSSTCGASRR ENST00000400929_4 [5] 4262 MSSTSSKRAPTTATQRLKQDYLR...CDSWVCSLCLHGQVRAEEHRAG ENST00000400930_6 ... ... ... [9] 2718 MTPYEGKDSVLRRRTPGGFYILS...LLGLQPLLTRSSTCGASRR ENST00000450390_4 [10] 4249 MPEIRVTPLGEWEPPGGLWGGLR...DEELGSFLTSLLKKGLPQAPS ENST00000618806_1 [11] 143 MLAGNEFQVSLSSSMSVSELKAQ...PLEDQLPLGEYGLKPLSTVFMN ENST00000624652_9 [12] 157 MLAGNEFQVSLSSSMSVSELKAQ...PLSTVFMNLRLRGGGTEPGGRS ENST00000624697_10 [13] 300 MVRQMSQVGGGGLCASQFSSPSP...SPAPCSICACGEAAQSLAGG ENST00000649529_8 Warning messages: 1: In .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet], : in 'x[[1]]': last 2 bases were ignored 2: In .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet], : in 'x[[2]]': last base was ignored 3: In .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet], : in 'x[[3]]': last base was ignored 4: In .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet], : in 'x[[4]]': last 2 bases were ignored 5: In .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet], : in 'x[[5]]': last base was ignored 6: In .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet], : in 'x[[8]]': last base was ignored 7: In .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet], : in 'x[[9]]': last 2 bases were ignored 8: In .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet], : in 'x[[11]]': last 2 bases were ignored 9: In .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet], : in 'x[[13]]': last 2 bases were ignored

Again thank you so much for all your help!

Roleren commented 1 year ago

Can you try to run without the orf that hits the flank ? Also from translate call it gives warning, your orfs are not made of triplets, also you see none of them end with a stop codon, they were defined from start and up to only first nucleotide of second last codon ? Looks strange to me

josefinaperalba commented 1 year ago

Hi, after investigating the data I found two cases that i think are interesting.

This is an example of an orf where i can run the compute features function: orf id: ENST00000624697_1 transcript_id: ENST00000624697 start:1014005 end: 1014475

Reference: transcript_id: ENST00000624697 feature start end transcript 1001138 1014540 exon 1001138 1001281 exon 1008194 1008279 exon 1013984 1014540 CDS 1014005 1014475 start_codon 1014005 1014007 stop_codon 1014476 1014478 five_prime_utr 1001138 1001281 five_prime_utr 1008194 1008279 five_prime_utr 1013984 1014004 three_prime_utr 1014479 1014540

Also on this i checked the width that is the part I'm getting the error and the values are [["ENST00000624697_1"]] 471 [["ENST00000624697"]] 144 86 557

Now here is an example of an orf that gives me this error: Error in pmapToTranscriptF(grl, reference, ignore.strand = ignore.strand, : Invalid ranges to map, check them. One has width bigger than its reference orf id: ENST00000450390_1 transcript_id: ENST00000450390 start:1255206 end: 1263361

Reference transcript_id: ENST00000450390 feature start end transcript 1253909 1273853 exon 1273666 1273853 exon 1267862 1267992 CDS 1267864 1267992 start_codon 1267990 1267992 exon 1266098 1266290 stop_codon 1267862 1267863 stop_codon 1266290 1266290 exon 1263346 1263386 exon 1257208 1257310 exon 1256992 1257130 exon 1256045 1256125 exon 1253909 1255487 five_prime_utr 1273666 1273853 three_prime_utr 1266098 1266289 three_prime_utr 1263346 1263386 three_prime_utr 1257208 1257310 three_prime_utr 1256992 1257130 three_prime_utr 1256045 1256125 three_prime_utr 1253909 1255487

width:

[["ENST00000450390_1"]] 8156 [["ENST00000450390"]] 188 131 193 41 103 139 81 1579

Thank you so much!!

Roleren commented 1 year ago

Yeah, there you have it. The last ORF (ENST00000450390_1) is over 8k long, while the full transcript is only 2k long. Which is not possible.. Where did you get names from? Maybe it is a different isoform which is correct ?

josefinaperalba commented 1 year ago

Hi! Thanks for your response! After carefully analyzing the data, I've found that the coordinates i was using contained introns and were not just the exonic regions of the orf.

Now I have changed that and my orfs Granges looks like this

orfs
GRangesList object of length 67654:
$ENST00000000233_1
GRanges object with 1 range and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]        7 127589083-127589163      +
  -------
  seqinfo: 26 sequences from an unspecified genome; no seqlengths

$ENST00000000233_2
GRanges object with 1 range and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]        7 127589485-127589594      +
  -------
  seqinfo: 26 sequences from an unspecified genome; no seqlengths

$ENST00000000233_3
GRanges object with 1 range and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]        7 127590066-127590137      +
  -------
  seqinfo: 26 sequences from an unspecified genome; no seqlengths

...
<67651 more elements>
> seqinfo(orfs)
Seqinfo object with 26 sequences from an unspecified genome; no seqlengths:
  seqnames   seqlengths isCircular genome
  1                <NA>       <NA>   <NA>
  2                <NA>       <NA>   <NA>
  3                <NA>       <NA>   <NA>
  4                <NA>       <NA>   <NA>
  5                <NA>       <NA>   <NA>
  ...               ...        ...    ...
  22               <NA>       <NA>   <NA>
  X                <NA>       <NA>   <NA>
  Y                <NA>       <NA>   <NA>
  MT               <NA>       <NA>   <NA>
  KI270728.1       <NA>       <NA>   <NA>

this is the seqinfo information of my aligment files

seqinfo(df.rfp[6,])
Seqinfo object with 194 sequences from an unspecified genome:
  seqnames   seqlengths isCircular genome
  1           248956422       <NA>   <NA>
  10          133797422       <NA>   <NA>
  11          135086622       <NA>   <NA>
  12          133275309       <NA>   <NA>
  13          114364328       <NA>   <NA>
  ...               ...        ...    ...
  KI270539.1        993       <NA>   <NA>
  KI270385.1        990       <NA>   <NA>
  KI270423.1        981       <NA>   <NA>
  KI270392.1        971       <NA>   <NA>
  KI270394.1        970       <NA>   <NA>  

I have tried several ways of intersecting them but i keep getting this error

  dt <- computeFeatures(orfs,
+                 RFP = fimport(filepath(df.rfp[6,],"pshifted")), Gtf = df.rfp@txdb, faFile = df.rfp@fafile,
+                 weight.RFP = "score")
No RNA added, skipping feature te and fpkm of RNA, also ribosomeReleaseScore will also be not normalized best way possible.
Error: subscript contains invalid names

What would be the problem now and what is the best way of intersecting?

Roleren commented 1 year ago

Just to make sure first, could you run this: any(lengths(orfs) > 1) # Test to see there is an orf with 2 exons, if not it is most likely still wrong.

Then for your error: so "subscript contains invalid names" means that it tries a [] subsetting, but could not find that name. To replicate try this: library(GenomicRanges); a <- GRangesList(a = GRanges("1", 1)); a["b"] # "b" is not a name in GRL

My guess is that you have an orf that has a transcript name which is not in your annotation etc.

To speed up finding the bug, I can show you how to debug properly.

First run this:

# <- Define all variables needed
debug(computeFeatures)
dt <- computeFeatures(RFP = fimport(filepath(df.rfp[6,],"pshifted")), Gtf = df.rfp@txdb, faFile = df.rfp@fafile, weight.RFP = "score")
# Now in debug mode in console press n + enter for next 
# and s + enter to step into a function, the first round figure out which sub function inside computeFeatures fails, then you can run:
undebug(computeFeatures) # Remove debug flag for computeFeatures
debug(function_that_fails) # And now run computeFeatures again

I am quite sure you will find which line fails and why from that :)

josefinaperalba commented 1 year ago

Hi, I have found the problem, the debugging instruction were really helpful!. It was that i have some orfs that do not have 5' UTR region in the reference. I proceeded to delete those but now I'm back at this error only when enabling uorfFeatures option: Error in pmapToTranscriptF(grl, reference, ignore.strand = ignore.strand, : Invalid ranges to map, check them. One has width bigger than its reference.

My reference data looks like this for a transcript id

Captura de pantalla 2023-11-14 210110

and my orf table looks like this for the same transcript_id

Captura de pantalla 2023-11-14 210735

Now, I've also analyzed the widths of those objects to se from where the error is comming. as you can see now i dont have orfs whose exonic regions are bigger than its reference but i think that there is a problem with the mapping.

xWidths IntegerList of length 65542 [["ENST00000000233_1"]] 81 [["ENST00000000233_2"]] 110 [["ENST00000000233_3"]] 72 [["ENST00000000233_4"]] 126 [["ENST00000000233_5"]] 84 [["ENST00000000412_1"]] 176 [["ENST00000000412_2"]] 167 [["ENST00000000412_3"]] 110 [["ENST00000000412_4"]] 131 [["ENST00000000412_5"]] 127 ...

<65532 more elements> > txWidths IntegerList of length 65542 [["ENST00000000233"]] 155 81 110 72 126 488 [["ENST00000000233"]] 155 81 110 72 126 488 [["ENST00000000233"]] 155 81 110 72 126 488 [["ENST00000000233"]] 155 81 110 72 126 488 [["ENST00000000233"]] 155 81 110 72 126 488 [["ENST00000000412"]] 158 177 167 110 131 127 1580 [["ENST00000000412"]] 158 177 167 110 131 127 1580 [["ENST00000000412"]] 158 177 167 110 131 127 1580 [["ENST00000000412"]] 158 177 167 110 131 127 1580 [["ENST00000000412"]] 158 177 167 110 131 127 1580 ... <65532 more elements> > xWidths[xWidths>=txWidths] IntegerList of length 65542 [["ENST00000000233_1"]] integer(0) [["ENST00000000233_2"]] integer(0) [["ENST00000000233_3"]] integer(0) [["ENST00000000233_4"]] integer(0) [["ENST00000000233_5"]] integer(0) [["ENST00000000412_1"]] 176 [["ENST00000000412_2"]] 167 [["ENST00000000412_3"]] integer(0) [["ENST00000000412_4"]] integer(0) [["ENST00000000412_5"]] integer(0) ... <65532 more elements> Again thank you so much!
Roleren commented 1 year ago

I see a big worry! I think you might not have any splicing information for the orfs ORFs.

So here is the problem: An ORF might span multiple exons, so if an ORF spans 2 exons it needs 2 GRanges for the object.

Like this:

> orfs <- GRangesList(ENST0000051251_1 = GRanges("1", IRanges(c(1, 5), width = 3), "+"))
> orfs
GRangesList object of length 1:
$ENST0000051251_1
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1       1-3      +
  [2]        1       5-7      +

You see this ORFs spans 2 exons (in this case the intron is 1 base at position 4).

RiboCode only gives transcript coordinates at column "ORF_tstart" and genomic at ORF_gstart

If you define ORFs directly from ORF_gstart you don't have the splicing information.

So the proper way is to take the "ORF_tstart" and map to genomic For full example, run this:

library(ORFik)
mrna_genomic <- GRangesList(ENST0000051251 = GRanges("1", IRanges(c(1, 20), width = 9), "+")) # The mRNAs, Intron at 10-19, for you load through this: mrna_genomic <- loadRegion(df_rfp, "tx")
orfs_starts <- c(4, 7) # tx_coordinate: in RiboCode: ORF_tstart
orf_widths <- c(6, 6) # 6 nt long ORFs, this is ORF_tstop - ORF_tstart + 1 (6 - 1 + 1 = 6)
orf_tx_parrent <- rep("ENST0000051251", 2) # Both from same tx
orfs_tx <- IRanges(orfs_starts, width = orf_widths) # In tx coordinates (2 ORFs)
tx_orf_matching <- data.table::chmatch(orf_tx_parrent, names(mrna_genomic)) # Match orf to tx index
orfs_tx <- split(orfs_tx, tx_orf_matching)
all_indices_found <- all(!is.na(names(orfs_tx)))
stopifnot(all_indices_found) # Then you have an ORF which is not from an mRNA
# Now map to genomic coordinates with splicing and give correct names, txid_1 etc
ORFik:::mapToGRanges(mrna_genomic, orfs_tx, groupByTx = FALSE)

Did this give you what you needed?

Also: So RiboCode output has the column "- ORF_type", which for uORFs are "uORF" and "Overlap_uORF". Are you sure you subsetted to those when settings uorfFeatures = TRUE ? Since computeFeatures only works for uORFs then, not the other types (Since it then requires 5' UTRs, which might not exist)

From your error above (an ORF from a transcript without 5' UTR, it sounds like you have all ORF types in your list). If so, just subset to uORF types and run again, then it should work :) So you can split your set in 2, the uORF set, which you can run with uORF features, and all the others. Which you can run without uORF features.

josefinaperalba commented 1 year ago

perefect! Thank you so much! I got it and I'm finally able to run this function, now my data looks like this orfs GRangesList object of length 5946: $ENST00000000233_1 GRanges object with 5 ranges and 0 metadata columns: seqnames ranges strand

[1] 7 127589083-127589163 + [2] 7 127589485-127589594 + [3] 7 127590066-127590137 + [4] 7 127590963-127591088 + [5] 7 127591213-127591296 + ------- seqinfo: 26 sequences from an unspecified genome; no seqlengths $ENST00000000412_1 GRanges object with 6 ranges and 0 metadata columns: seqnames ranges strand [1] 12 8946229-8946404 - [2] 12 8945418-8945584 - [3] 12 8943801-8943910 - [4] 12 8943405-8943535 - [5] 12 8942416-8942542 - [6] 12 8941821-8941940 - ------- seqinfo: 26 sequences from an unspecified genome; no seqlengths $ENST00000001008_1 GRanges object with 8 ranges and 0 metadata columns: seqnames ranges strand [1] 12 2797767-2797871 + [2] 12 2798706-2798826 + [3] 12 2799088-2799244 + [4] 12 2799850-2799940 + [5] 12 2800039-2800122 + [6] 12 2800392-2800577 + [7] 12 2801117-2801356 + [8] 12 2803151-2803255 + ------- seqinfo: 26 sequences from an unspecified genome; no seqlengths ... <5943 more elements> I have only one more question. In my data there are some novel orfs which do not have corresponding three prime utr regions in the reference. I've seen that in those cases the scores are calculated except for the RRS where of course i get a NA due to not having matching 3 prime utr in the reference. Do you have any function in the library that i can use for those cases? Again thank you so much, you are being extremely helpful and I'm super excited about this results!
Roleren commented 1 year ago

Sure, I usually make these pseudo scores like this:

res <- computeFeatures(orfs, …) # Your normal calculation
no_trailer <- is.na(res$RRS) # Detect et
orfs_with_no_trailers <- orfs[no_trailer]
cds_no_trailers <- cds[txNames(orfs_with_no_trailers)]
cds_stop_no_trailers <- stopCodons(cds_no_trailers, TRUE)
pseudo_trailers <- extendTrailers(cds_stop_no_trailers, extension = 300) # 300 nucleotide pseudo trailers
res_no_trailer <- ribosomeReleaseScore(orfs_with_no_trailers, ribo_seq, pseudo_trailers, weight.RFP = "score")
# Now merge in RRS values from  pseudo values
res[no_trailer, ]$RRS <- res_no_trailer
josefinaperalba commented 12 months ago

Hi, as always thank you so much for your response!! One question, for those orfs that I want to implement the extendTrailers function. Can I use the stop codons that are in the result from the computeFeatures function?

Roleren commented 12 months ago

Hm, both stopCodons and extendTrailers are existing functions in ORFik, no need to reimplement it.

This code will just run, as long as you have orfs and cds as GRangesList objects :)

josefinaperalba commented 12 months ago

I mean for those orfs i don't have the anotated CDS regions so I cannot run this commands

cds_no_trailers <- cds[txNames(orfs_with_no_trailers)] cds_stop_no_trailers <- stopCodons(cds_no_trailers, TRUE)

That's why I'm trying to find another way to extract the stop codons.

Thanks!

Roleren commented 12 months ago

If I understand you mean from "non coding transcripts". ? I.e. the transcript has no defined CDS ?

If so, the easiest way is to say anything downstream of orf is 3' utr. To do that just swap CDS with orfs for those. (Note: there is a corner case when 2 orfs are on the same non coding transcript, then the 3' utr of the first will overlap the downstream orf)

I.e. do: Stops <- stopCodons(orfs) trailers <- extendTrailers(stops, downstream =Inf) # extend infinite = to end of transcript.

Did that work ?

josefinaperalba commented 11 months ago

Hi! thanks for your response. I couldn't find the downstream argument in the extendTrailers function. I tried it with the extension argument set to Inf but then I'm getting this error:


novel <- result[result$name %in% dt[is.na(dt$RRS)]$names,]
orfs_novel<-makeGRangesListFromDataFrame(novel, seqnames.field= 'seqname', split.field = 'name')
stops <- stopCodons(orfs_novel)
trailers <- extendTrailers(stops, extension = Inf)
Error in recycleArg(arg, argname, length.out) : 'width' contains NAs
In addition: Warning message:
In recycleIntegerArg(width, "width", length(x)) :
  NAs introduced by coercion to integer range

   stops
GRangesList object of length 32001:
$ENST00000158526_1
GRanges object with 1 range and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]        X 154450097-154450099      +
  -------
  seqinfo: 25 sequences from an unspecified genome; no seqlengths

$ENST00000216019_1
GRanges object with 1 range and 0 metadata columns:
      seqnames            ranges strand
         <Rle>         <IRanges>  <Rle>
  [1]       22 38492045-38492047      -
  -------
  seqinfo: 25 sequences from an unspecified genome; no seqlengths

$ENST00000216019_2
GRanges object with 1 range and 0 metadata columns:
      seqnames            ranges strand
         <Rle>         <IRanges>  <Rle>
  [1]       22 38492045-38492047      -
  -------
  seqinfo: 25 sequences from an unspecified genome; no seqlengths

...
<31998 more elements>
josefinaperalba commented 11 months ago

Hi! since I was not able to make the extendTrailers function worked I tried to manually implement your approach. So I create a Granges list with each transcript whitout the annotated 3 prime utr region. As start position I used the orf end position and as end position I used the end position of the transcript as you mentioned before.

My list now looks like this

> three_utrs_granges
GRangesList object of length 9209:
$ENST00000158526
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        X 154450099      +
  -------
  seqinfo: 25 sequences from an unspecified genome; no seqlengths

$ENST00000216019
GRanges object with 9 ranges and 0 metadata columns:
      seqnames            ranges strand
         <Rle>         <IRanges>  <Rle>
  [1]       22 38494939-38506294      -
  [2]       22 38494930-38506294      -
  [3]       22 38494784-38506294      -
  [4]       22 38494652-38506294      -
  [5]       22 38494118-38506294      -
  [6]       22 38494106-38506294      -
  [7]       22 38487918-38506294      -
  [8]       22 38486390-38506294      -
  [9]       22 38486380-38506294      -
  -------
  seqinfo: 25 sequences from an unspecified genome; no seqlengths

$ENST00000216407
GRanges object with 1 range and 0 metadata columns:
      seqnames            ranges strand
         <Rle>         <IRanges>  <Rle>
  [1]       12 68474718-68474902      +
  -------
  seqinfo: 25 sequences from an unspecified genome; no seqlengths

...
<9206 more elements>

I ran RRS using three prime utrs regions and i got results. I wanted to check if the transcripts where i have multiple region due to having multiple orfs associated (example: ENST00000216019) I can take the samallest range as 3'UTR to make sure I dont overlap with any orf.

Thank you so much!!

Roleren commented 11 months ago

Yes, good. This should work.

As a note extendTrailers was the wrong function, it was: stopRegion(orfs, tx, upstream = 0, downstream = Inf)

You could test that to validate it gives the same result.

Let me know how it goes and then I will close the issue :)