LabTranslationalArchitectomics / riboWaltz

optimization of ribosome P-site positioning in ribosome profiling data
MIT License
46 stars 12 forks source link

riboWaltz run psite() function meet error #77

Closed Jiahuan90 closed 8 months ago

Jiahuan90 commented 8 months ago

Hi Fabio,

When I ran psite() function, I meet error: Error in if ((extremity == "auto" & ((best_from3_tab[, perc] > best_from5_tab[, : argument is of length zero. Calls: psite

I realize that someone asked this question before, but I don't know how it was resolved in the end. Could you help me check what is going wrong? I use the GRCm38 mouse genome, and related Ensembl GTF file. I upload the bamlist of my data and GTF file.

Thank you for your kindly help!

Test.zip

fabiolauria commented 8 months ago

H there, thank you for using riboWaltz.

The error you reported is often thrown when, for all or specific read lengths, the tool is not able to find enough reads around the reference codon (the start, usually). If this is the case, downstream steps such as compute the maximum number of read extremities around the reference codon to set the temporary offset cannot be performed.

With this in mind and assuming you didn't subset your dataset, I went through your data. As a first observation, I immediately noted you have a very low number of total reads (24634) and, of these, only 50 map around the start codon of annotated transcripts. Since riboWaltz stratifies the reads by length and there are 17 different lengths, there are on average 3-4 reads per length, which are not enough for proceeding.

Now, if you did subset your dataset, I would guess that there is at least one read length associated with very few reads. In fact, I noticed that you didn't filter any read length out. I would suggest to keep only reads above ~20 and below ~50 (unless you are interested in disomes). This range is more than enough to include bona fide ribosome protected fragments. Or you can check the distribution of read lengths and keep only those falling in the range covering 90-95% of the reads and exclude those in the tails of the distribution.

Let me now if this makes sense to you.

Best, Fabio

Jiahuan90 commented 8 months ago

H there, thank you for using riboWaltz.

The error you reported is often thrown when, for all or specific read lengths, the tool is not able to find enough reads around the reference codon (the start, usually). If this is the case, downstream steps such as compute the maximum number of read extremities around the reference codon to set the temporary offset cannot be performed.

With this in mind and assuming you didn't subset your dataset, I went through your data. As a first observation, I immediately noted you have a very low number of total reads (24634) and, of these, only 50 map around the start codon of annotated transcripts. Since riboWaltz stratifies the reads by length and there are 17 different lengths, there are on average 3-4 reads per length, which are not enough for proceeding.

Now, if you did subset your dataset, I would guess that there is at least one read length associated with very few reads. In fact, I noticed that you didn't filter any read length out. I would suggest to keep only reads above ~20 and below ~50 (unless you are interested in disomes). This range is more than enough to include bona fide ribosome protected fragments. Or you can check the distribution of read lengths and keep only those falling in the range covering 90-95% of the reads and exclude those in the tails of the distribution.

Let me now if this makes sense to you.

Best, Fabio

Thank you very much for your response and the suggestions you provided. I'll try implementing your suggestion and will update you on the situation later. Your help has been sincerely appreciated!

Jiahuan90 commented 8 months ago

H there, thank you for using riboWaltz.

The error you reported is often thrown when, for all or specific read lengths, the tool is not able to find enough reads around the reference codon (the start, usually). If this is the case, downstream steps such as compute the maximum number of read extremities around the reference codon to set the temporary offset cannot be performed.

With this in mind and assuming you didn't subset your dataset, I went through your data. As a first observation, I immediately noted you have a very low number of total reads (24634) and, of these, only 50 map around the start codon of annotated transcripts. Since riboWaltz stratifies the reads by length and there are 17 different lengths, there are on average 3-4 reads per length, which are not enough for proceeding.

Now, if you did subset your dataset, I would guess that there is at least one read length associated with very few reads. In fact, I noticed that you didn't filter any read length out. I would suggest to keep only reads above ~20 and below ~50 (unless you are interested in disomes). This range is more than enough to include bona fide ribosome protected fragments. Or you can check the distribution of read lengths and keep only those falling in the range covering 90-95% of the reads and exclude those in the tails of the distribution.

Let me now if this makes sense to you.

Best, Fabio

@fabiolauria hi,Fabio,

The data I uploaded isn't a subset; it represents the entire set. However, this data comes from ribo-seq performed on a small number of cells (about 150 cells). Despite the bam file being large, at 2.5GB, there are only 24,634 usable reads. I also noticed that someone else's data, with a bam file of only 24M mapped to the transcriptome, resulted in 1,103,531 reads after bam2list, which is 40 times the number of reads I have. Could this discrepancy be caused by having too many duplicate reads in my data? Additionally, my original sample had a high sequencing volume, reaching 30G.

fabiolauria commented 8 months ago

Hi there

The data I uploaded isn't a subset; it represents the entire set. However, this data comes from ribo-seq performed on a small number of cells (about 150 cells). Despite the bam file being large, at 2.5GB, there are only 24,634 usable reads.

Now I understand. It makes sense al lot fo sense your sample is small given the number of cells. Unfortunately I cannot be of much help here, since I have no experience with analyses of such low-input data. I now for sure that 500K reads, even a bit less, of the proper length should be enough for good analyses.

I also noticed that someone else's data, with a bam file of only 24M mapped to the transcriptome, resulted in 1,103,531 reads after bam2list, which is 40 times the number of reads I have. Could this discrepancy be caused by having too many duplicate reads in my data?

For sure duplicates can drastically decrease the final number of reads. You should check the data before and after the deduplication to try and understand what is going on and where reads are vanishing.

I am sorry riboWaltz cannot support you in your analysis, but unfortunately there is nothing I can do right now.

If you have additional questions about riboWaltz or other errors you faced in this context I'll leave the issue open for a couple of days. Otherwise, feel free to close it yourself.

Best Fabio

Jiahuan90 commented 8 months ago

Hi there

The data I uploaded isn't a subset; it represents the entire set. However, this data comes from ribo-seq performed on a small number of cells (about 150 cells). Despite the bam file being large, at 2.5GB, there are only 24,634 usable reads.

Now I understand. It makes sense al lot fo sense your sample is small given the number of cells. Unfortunately I cannot be of much help here, since I have no experience with analyses of such low-input data. I now for sure that 500K reads, even a bit less, of the proper length should be enough for good analyses.

I also noticed that someone else's data, with a bam file of only 24M mapped to the transcriptome, resulted in 1,103,531 reads after bam2list, which is 40 times the number of reads I have. Could this discrepancy be caused by having too many duplicate reads in my data?

For sure duplicates can drastically decrease the final number of reads. You should check the data before and after the deduplication to try and understand what is going on and where reads are vanishing.

I am sorry riboWaltz cannot support you in your analysis, but unfortunately there is nothing I can do right now.

If you have additional questions about riboWaltz or other errors you faced in this context I'll leave the issue open for a couple of days. Otherwise, feel free to close it yourself.

Best Fabio

@fabiolauria Hi there,

I wanted to express my deepest gratitude for your insightful responses. Your two responses have clarified for me that the core of riboWaltz analysis hinges on the number of reads. Following your guidance, I adjusted the STAR mapping parameters, which significantly increased the number of reads mapping to the transcriptome, now totaling 6 million reads. This improvement allows me to proceed further with riboWaltz analysis.

As I delve deeper into the analysis, two new questions have emerged:

  1. Regarding the analysis of ribosome pausing, there seems to be no definitive conclusion on whether to remove duplicate reads. I noted your statement that "Duplicates may be either representative of meaningful biological information on ribosome stalling or generated by PCR amplification during library preparation." Could you elaborate on how to approach this aspect?

  2. I've noticed in many of your answers the recommendation to use an appropriate read length for downstream analysis, such as 20-50nt reads. Could you please explain the rationale behind this specific range?

Thank you once again for your invaluable help. Your expertise not only aids in advancing my project but also deepens my understanding of the intricacies involved in riboWaltz analysis.

Looking forward to your advice.

Best Jiahuan

fabiolauria commented 8 months ago

Hi Jiahuan, I confirm that riboWaltz infers the P-site offset according to the population of reads mapping around the reference (start) codon. Thus, as you said, if the number of reads is too low the P-site offset cannot be uniquely identified - or cannot be identify at all - and riboWaltz crashes.

Regarding the analysis of ribosome pausing, there seems to be no definitive conclusion on whether to remove duplicate reads. I noted your statement that "Duplicates may be either representative of meaningful biological information on ribosome stalling or generated by PCR amplification during library preparation." Could you elaborate on how to approach this aspect?

Unfortunately there is not a "correct" way to proceed here. It's a matter of how the bioinformatician wants to treat the data according to his experience and the hints he gets from the literature. If you keep all the reads mapping on the very same position, you likely keep many reads which are there because the PCR "generated" them and not because multiple stalled ribosomes were actually associated with the sequenced fragments. On the other hand, if you remove all the duplicates, you may loose reads coming from ribosomes stalling on specific positions. As I said, the "truth" doesn't exist.

I've noticed in many of your answers the recommendation to use an appropriate read length for downstream analysis, such as 20-50nt reads. Could you please explain the rationale behind this specific range?

Eukaryotic ribosomes cover ~30 nts. In specific conformations they cover more nts or less nts, with their footprint ranging, let's say, from 20 and 40 nts (50 nts are already a bit more than expected). Given this, looking at the distribution of the read lengths you should see a peak around 30 nts and very few reads <20 nts and > 40 nts. In my experience, this is usually the case. If not, something might went wrong at a certain point.

This said, I'm happy to see you managed to run riboWaltz and solve the error at the beginning of the discussion. Thus, I'll close this issue to not go off-topic. If you have additional technical problems or get stuck at another riboWaltz function, feel free to open new issues, one for each funciton-specific question/error so that the other users can easily find them, if needed.

Best Fabio