LabTranslationalArchitectomics / riboWaltz

optimization of ribosome P-site positioning in ribosome profiling data
MIT License
43 stars 11 forks source link

genomic alignments #31

Closed daniel-spies closed 3 years ago

daniel-spies commented 3 years ago

hi there,

I have reads aligned by STAR (first genome then transcriptome) and therefore the seqname read in by the bamtolist method is only the chromosome which then is not found in the generated annotation data table.

your parameter transcript_align only sets the strand but does not extract transcript names from an annotation for genomic reads.

would be nice to have these mapped on the run by supplying a GTF file, using the txdb specified in the annotation creation process or allowing to pass a created txdb object (in case of custom GTF) and not having to convert to BED, intersect and then use the bedtolist function.

best Daniel

fabiolauria commented 3 years ago

Hi Daniel. Sorry but I'm not sure I got the point here. Here a some comments that might be related to your issue:

  1. riboWaltz only works for read alignments based on transcript coordinates. Most reads from RiboSeq are supposed to map on mRNAs and not on introns and intergenic regions;
  2. as a consequence, BAM based on gene coordinates (and related "genomic reads") cannot be used and only BAM based on transcript coordinates are suitable for riboWaltz; Thus, if you want to compare genomic and transcript alignment, it's not possible.
  3. BAM based on transcript coordinates can be generated by: i) aligning directly against transcript sequences (using STAR, BOWTIE etc). In this case, _transcriptalign should be set to TRUE (the default); ii) aligning against standard chromosome sequences and requiring the outputs to be translated in transcript coordinates (using STAR and its option -quantMode TranscriptomeSAM). In this case, _transcriptalign should be set to FALSE;
  4. in both cases, the BAM files suitable for riboWaltz analyses include read seqnames containing transcript names, in agreement with transcript names listed in the annotation data table. It may be required to "cut" the read seqnames to remove additional information following the name of the transcript (see _refseqsep in bamtolist);
  5. the annotation data.table must be built starting from the GTF file downloaded from the very same repository providing the FASTA file used for the alignment (in order to ensure the match of the transcript names)

I hope you can find something useful for your purposes. If not, please let me know

Best Fabio

daniel-spies commented 3 years ago

Dear Fabio,

Thanks for the fast reply.

We’re interested in finding uORFs and novel ORFs that are not annotated, therefore we’re not mapping against the only against the transcriptome and also having non canonical start codons. But then apparently RiboWaltz is unfortunately the wrong if its only considering annotated regions and not defining novel ORFs as well.

Then I will use other tools such as Rp-Bp to extract all possible ORFs, filter for those with reads and then create the periodicity plots myself.

Thanks Daniel

On 1 Oct 2020, at 16:48, fabiolauria notifications@github.com wrote:

Hi Daniel. Sorry but I'm not sure I got the point here. Here a some comments that might be related to your issue:

riboWaltz only works for read alignments based on transcript coordinates. Most reads from RiboSeq are supposed to map on mRNAs and not on introns and intergenic regions; as a consequence, BAM based on gene coordinates (and related "genomic reads") cannot be used and only BAM based on transcript coordinates are suitable for riboWaltz; Thus, if you want to compare genomic and transcript alignment, it's not possible. BAM based on transcript coordinates can be generated by: i) aligning directly against transcript sequences (using STAR, BOWTIE etc). In this case, transcript_align should be set to TRUE (the default); ii) aligning against standard chromosome sequences and requiring the outputs to be translated in transcript coordinates (using STAR and its option -quantMode TranscriptomeSAM). In this case, transcript_align should be set to FALSE; in both cases, the BAM files suitable for riboWaltz analyses include read seqnames containing transcript names, in agreement with transcript names listed in the annotation data table. It may be required to "cut" the read seqnames to remove additional information following the name of the transcript (see refseq_sep in bamtolist); the annotation data.table must be built starting from the GTF file downloaded from the very same repository providing the FASTA file used for the alignment (in order to ensure the match of the transcript names) I hope you can find something useful for your purposes. If not, please let me know

Best Fabio

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/LabTranslationalArchitectomics/riboWaltz/issues/31#issuecomment-702187402, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJF7RTZ5EICKWM66LDTKTDSISJDJANCNFSM4SAK54DQ.

fabiolauria commented 3 years ago

Hi Daniel. As you said, riboWaltz has been developed for the investigation of ribosomes localization along annotated sequences rather than the identification and analysis of novel uORF. You should be able to reach your goals by proceeding as you proposed. Nevertheless, if I can be of any help in the future, just let me know.

Best Fabio