Roleren / ORFik

MIT License
32 stars 9 forks source link

RAI Score using subset bam file #151

Closed gpattarone closed 8 months ago

gpattarone commented 9 months ago

Hey, I hope this message finds you well. I am currently working on calculating the Ribosome Association Index (RAI) for a set of ORFs using Riboseq reads. While delving into this process, I came across an explanation you provided in another open GitHub issue that seemed relevant to my work and use of ORFik library.

The specific formula you mentioned reads as follows:

_"If you wanted to make a mock for yourself, just define RAI: x_i (count of ORF > 10 for sample i), y_i (translation status: use FLOSS, RRS, and ORFscore, set cutoff from cds 90% included for sample i) / x_i."

I have a few queries regarding this mock:

x_i -Does this calculation you described with the library operate per ORF or per Transcript? -What does 'count of ORF > 10' signify? Does it imply that there should be more than 10 ORFs in total within the transcript, or does it refer to some relative value of the ORF being greater than 10? -Concerning the part where you mention "for sample i," I am currently working with a subset of BAM files. Could you provide guidance on interpreting this correctly: "x_i = 1 if there are more than 10 ORFs in total in..."?

y_i -Regarding y_i, how does the translation status of FLOSS, RRS, and ORFscore align with the original formula where ribo_counts / (transcript length * RPKM transcript) is used? I mean how I can manage the 3 to be able to establish after a cut-off value of 90%

I genuinely appreciate your assistance with these queries. I apologize in advance for the multitude of questions, but your guidance has proven immensely valuable in resolving other issues. That's why I wanted to seek your expertise regarding this particular matter.

Thank you once again for your time and help.

Roleren commented 9 months ago

First a note, never work on subset of bam files. If it is too big, then collapse and convert to ofst format. Please see function ORFik::convert_bam_to_ofst. The collapsed ofst you will be able to load on any computer.

Now for x_i, it is the overlapping reads per orf. Just like DESeq uses 10 reads as minimum for expression analysis by default. So just use countOverlapsW(orfs, ribo, "score") or the counts column from computeFeatures function

For y_i, this is not what I mean, ignore what they state there, it is what the x_i does. What I say is that you create a score from the 3, run it for all CDS's and see what the minimum score is for any CDS with fpkm >1 or something. In code you get the CDS scores as:

``CDS_filt <- CDS[fpkm(CDS, ribo) > 1]

res <- y_i_function(CDS_filt, ribo)

quantile(res, 0.1) # this is your cutoff value for orfs``

These scores are not magic or set in stone. Play around with them and see what makes sense, make your own :)

If you have any questions about implementation let me know

gpattarone commented 9 months ago

Thank you so much for your quick response. I was working guided by your comment and would like to add just a few questions related to the "translation status score" calculated with FLOSS, RRS, and ORFscore functions.

In our case we work with ORFs derived from Ribo-seq, and usually we don't have RNA-seq data and we don't expect to manage it also in future samples, so we would like to understand and adapt better your statement "run it for all CDS's and see what the minimum score is for any CDS with fpkm >1 or something" for our datasets: how we can implement it for Ribo-seq data? and how you would recommend us to set the cutoff value for orfs in our case?

We are sorry to bother you again and thank you so much for your time and help once more time

Roleren commented 9 months ago

Yes, I already sent a solution for that. See code above, the fpkm function exists in ORFik. Just supply the CDSs and riboseq data (as GRanges object). And filter on fpkm output > 1. Let me know how it works.

gpattarone commented 9 months ago

Great, thanks! now I following you better for fpkm usage and did it. What I am missing in 'how to' is the: "y_i_function" you wrote.

Could you please explain me what that function mean and how to implement it?

Thank you again!

Roleren commented 9 months ago

To get y_i column do this:

#Run computeFeatures on both CDS and ORFs:
cds_features <- computeFeatures(cds, RFP, RNA = NULL, Gtf = loadTxdb(df@txdb), uorfFeatures = FALSE)
orf_features <- computeFeatures(orfs, RFP, RNA = NULL, Gtf = loadTxdb(df@txdb), uorfFeatures = FALSE)
# Save for later
saveRDS(cds_features, "project_dir/cds_features.rds")
saveRDS(orf_features, "project_dir/orf_features.rds")

cds_fpkm_filter <- quantile(cds_features$fpkm, 0.1) # Filter for accepted values
y_i_cds <- cds_features[fpkm >= cds_fpkm_filter & ORFScores > 0, .(RRS, ORFScores, floss)] # All frame 0 biased cds over the 10% fpkm quantile
res <-  orf_features[countRFP > 10 & fpkm >= y_i_cds$fpkm &  (RRS > y_i_cds$RRS) & (ORFScores >= y_i_cds$ORFScores) & (floss > y_i_cds$floss),]  # Filter on 10 counts, with fpkm > 10% quantile of cds, ...

Try this and let me know how it works.

gpattarone commented 9 months ago

Hey, thank you for your time and explanations. It was super clear and super helpful your insight on this calculation. I would like to mention that we don't have the CDS (Coding DNA Sequence) data and we don't expect to manage it for future samples. So, would like to ask you, if you know how we can get the same y_i information without use CDS data?

Thank you! and apologies for bother you again

Roleren commented 9 months ago

Well then you have to set manual cut-offs, my initial idea would be something like this:

res <- orf_features[countRFP > 10 & fpkm >= 1 & (RRS > quantile(RRS, 0.5, na.rm = TRUE)) & (ORFScores >= 6.3) & (floss > 0.25),] # ORFscore = 6.3 is the chi square distribution significance cut off point at p-value 0.05, floss 0.25 is a good pick for me usually.

Play around and see what you lose, by using this cut-off. Check out which are false in IGV or using our custom ribo-seq browser Ribocrypt (check out ribocrypt.org to see how it looks)

To get tracks into IGV or RiboCrypt convert to bigwig first, like this:

df <- ORFik::read.experiment("your_ribo_seq_exp")
convert_to_bigWig(df) # Now use the output files here in IGV or ribocrypt, to inspect what your miss using the filter
Roleren commented 9 months ago

Did it work ? :)

gpattarone commented 9 months ago

Yes, it works! Thank you so much, we have in our dataset different chi-square values but the approach methodology was helpful for explore the cut-off and seeing what we lose or choose what prioritize. Thank again!