LabTranslationalArchitectomics / riboWaltz

optimization of ribosome P-site positioning in ribosome profiling data
MIT License
43 stars 11 forks source link

bamtolist doesn't work on genomically aligned bams. #12

Closed zslastman closed 5 years ago

zslastman commented 5 years ago

In the below code, bamtolist checks whether the reads have a transcript matching the annotation:

    nreads <- nrow(dt)
    cat(sprintf("reads: %s M\n", format(round((nreads / 1000000), 2), nsmall = 2)))
    dt <- dt[as.character(transcript) %in% as.character(annotation$transcript)]
    if(nreads != nrow(dt)){
      if(nrow(dt) == 0){
        stop("%s M  (%s %%) reads removed: referece transcript ID not found in the annotation table\n\n")

But it does this before checking if the 'transcript_aligned' options is true:

    if(transcript_align == TRUE | transcript_align == T){
      dt <- dt[strand == "+"]
      cat(sprintf("%s M  (%s %%) reads removed: mapping on the negative strand\n", 
                  format(round((nreads - nrow(dt)) / 1000000, 2), nsmall = 2), 
                  format(round(((nreads - nrow(dt)) / nreads) * 100, 2), nsmall = 2) ))
    }

Thus the bamtolist function doesn't work on genomically aligned bams, which won't have any transcript field.

fabiolauria commented 5 years ago

Hi. The first bunch of code you posted was implemented to warn the users about discrepancies in transcript names in BAM and annotation files. The second one is to notify how many reads map on the negative strand and are then discarded. This should be done only if reads are aligned on transcript sequences and all reads should align on positive strand. These two steps are not connected because riboWaltz is not designed for BAM files based on genomic coordinates. This choice is due to the main aim of the RiboSeq assay that is supposed to generate RNA fragments from mRNAs but not from introns or intergenic regions.

There might have been a misunderstanding about what riboWaltz can do and which are the inputs it uses. Up to now, riboWaltz only works with BAM from transcriptome alignments. BAM from transcriptome alignments can be generated in two ways:

All the references in the documentation of riboWaltz to genomic sequences refer to the latter case, where you do have a BAM file form a transcriptome alignemnt but the FASTA file contains the chromosome sequences instead of the transcript ones. I am very sorry this might have been misleading.

I'm putting an effort into updating the documentation to make this point as clear as possible.

Bests, Fabio