P-site adjustment and ORFscore

ashakru commented 4 years ago

Thank you for creating such a wonderful tool. It already has helped me a lot! I try to use ORFik to score novel ORF identified from Ribo-Seq data and as reference I used computeFeatures function on annotated CDS regions. I was surprised to see that most of annotated CDS had negative ORF score. I modified Ribo-Seq alignment so that each read represents now estimated P-site location. Information about read length was moved to score (similar approach was shown in ORFik vignette, if I'm not mistaken):

RIBOBAM <- "myRiboSeq.bam"
riboBam <- readGAlignments(RIBOBAM) 
scores <- qwidth(riboBam)      
riboBam <-  GenomicAlignments::qnarrow(riboBam, start = 12, width = 1)
riboBamGR <- GRanges(riboBam)
score(riboBamGR) <- scores

Then riboBamGR was used as input for computeFeatures.

Now, ORFscore has increased and I have 20 808 CDSs with positive ORFscore vs. 17 037 CDSs before P-site adjustment. However FLOSS score is now 0 for all of the CDSs. Do you think that P-site adjustment is essential for ORFik to compute a valid ORFscore? If so, what is in your opinion a good way of doing that to compute FLOSS score as well in one run?

Roleren commented 4 years ago

Hm, computeFeatures is made to be able to get both in the same run, so let me check that nothing strange is going on, else I need you to give me some more info on your data structure. Will get back to you in a few hours :)

Roleren commented 4 years ago

First off: if you get this warning, then I know the answer why it is happening:

Warning:
All widths are 1, If ribo-seq is p-shifted, score or size meta column should contain widths of read, will continue using 1-widths

If not, here are some thoughts: Well, I checked through the function and there is no bugs it seems. So can you show me how the riboBamGR object looks like ?

There are also 2 problems here:

you use qwidth(riboBam), this is not totally safe, I made a function in ORFik that takes care of soft / hard clipping issues, so instead of qwidth() use readWidths(), double check that it gives the same answer.
Why do you do custom p-shifting, you did not use detectRibosomeShifts() to find p-shifts ? If you have the shifts, the function shiftFootprints() does this for you. shiftFootprints() does all your code in one line, so why not use it ?

And why are you shifting all your reads by 12 ? Do you only have one read length ?

Let me know what you think.

ashakru commented 4 years ago

I can't see such warning. Addressing your comments:

riboBamGR is a simple GRanges object with read original width encoded in score:

GRanges object with 3795776 ranges and 1 metadata column:
          seqnames    ranges strand |     score
             <Rle> <IRanges>  <Rle> | <integer>
    [1]       chr1    629923      + |        29
    [2]       chr1    629924      + |        30
    [3]       chr1    629924      + |        29
    [4]       chr1    629937      + |        30
    [5]       chr1    629937      + |        29

qwidth() use readWidths() gave exactly the same answer
ORFik scoring is a part of larger pipeline. I predicted p-site offsets with plastid and after QCs I used them as an input for ORF identification tools. I planned to use the same offsets for scoring just to be consistent.
12 was just an example, because for testing I've selected only reads between 27-30 nt.

Update: I've managed to run computeFeatures, using the following code. Basically, I've replaced offsets computed by ORFik with those I've got from plastid and it worked like a dream.

# Load files
rnabam <- readBam("riboseq.bam")
ribobam <- readBam("rnaseq.bam")
# Define offsets
 plastidOut <- read_delim("plastid.out", delim = "\t", comment = "#") %>%
    filter(length %in% 25:31)
offsets <- data.frame(fraction = plastidOut$length,
                      offsets_start = -1*plastidOut$p_offset, 
                      stringsAsFactors = F)
# Input for computeFeatures
 psites <- shiftFootprints(ribobam, offsets)

Now the input looks like this and FLOSS score is not 0 anymore.

GRanges object with 1110784 ranges and 1 metadata column:
              seqnames    ranges strand |      size
                 <Rle> <IRanges>  <Rle> | <integer>
        [1]       chr1    138760      + |        29
        [2]       chr1    138760      + |        29
        [3]       chr1    629924      + |        31
        [4]       chr1    629926      + |        26
        [5]       chr1    629937      + |        30

It is interesting, because according to the manual read length information in score should have worked as well.

Roleren commented 4 years ago

Ok, thanks, I will test it today, looks like something is going on with how floss pick read length column.

Roleren commented 4 years ago

BTW, which ORFik and R version are you running ? Just send me: sessionInfo()

Roleren commented 4 years ago

I see the documentation is a bit unclear, will update it by tomorrow for devel version.

Roleren commented 4 years ago

So conclusion, I have update documentation to make this function more clear. It should now work as you wanted in devel branch, but I recommend using a $size column instead of $score.

I will close the issue in 2 days if you don't have any more questions.

Roleren / ORFik

P-site adjustment and ORFscore #94