Roleren / ORFik

MIT License
32 stars 9 forks source link

Problem with windowPerReadLength function #140

Closed JC-therea closed 1 year ago

JC-therea commented 1 year ago

Hello,

Very nice tool! However, I am facing some problems with custom annotations...

The function where I have problems is windowPerReadLength(). After following the pipeline I had an unexpexted result:

hitMap <- windowPerReadLength(grl = cds, tx = tx, reads = footprintsGR, pShifted = FALSE)
hitMap

position     score fraction
  1:        4  8.681545       30
  2:        8 12.641592       30
  3:       12 19.839488       30
  4:       16 16.555389       30
  5:       20 27.955665       30
 ---                            
180:       57  1.231521       31
181:       94  1.197693       31
182:       29  5.095845       31
183:       86  2.621593       31
184:       61  1.210251       31

As you can see it doesn't start in a position before the CDS and also the fraction is not a real fraction...

This is how it looks the inputs:

> cds
GRangesList object of length 5957:
$`gene-YHL048W-T1`
GRanges object with 1 range and 3 metadata columns:
        seqnames    ranges strand |    cds_id    cds_name exon_rank
           <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
  [1] BK006934.2 6401-7546      + |         1        <NA>         1
  -------
  seqinfo: 16 sequences from an unspecified genome; no seqlengths

$`gene-YHL044W-T1`
GRanges object with 1 range and 3 metadata columns:
        seqnames      ranges strand |    cds_id    cds_name exon_rank
           <Rle>   <IRanges>  <Rle> | <integer> <character> <integer>
  [1] BK006934.2 13565-14272      + |         2        <NA>         1
  -------
  seqinfo: 16 sequences from an unspecified genome; no seqlengths

$`gene-YHL043W-T1`
GRanges object with 1 range and 3 metadata columns:
        seqnames      ranges strand |    cds_id    cds_name exon_rank
           <Rle>   <IRanges>  <Rle> | <integer> <character> <integer>
  [1] BK006934.2 14901-15413      + |         3        <NA>         1
  -------
  seqinfo: 16 sequences from an unspecified genome; no seqlengths

...
<5954 more elements>

####################################################################
####################################################################
####################################################################
> tx
GRangesList object of length 5957:
$`gene-YHL048W-T1`
GRanges object with 1 range and 3 metadata columns:
        seqnames    ranges strand |   exon_id   exon_name exon_rank
           <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
  [1] BK006934.2 6353-8108      + |         1        <NA>         1
  -------
  seqinfo: 16 sequences from an unspecified genome; no seqlengths

$`gene-YHL044W-T1`
GRanges object with 1 range and 3 metadata columns:
        seqnames      ranges strand |   exon_id   exon_name exon_rank
           <Rle>   <IRanges>  <Rle> | <integer> <character> <integer>
  [1] BK006934.2 13549-14511      + |         2        <NA>         1
  -------
  seqinfo: 16 sequences from an unspecified genome; no seqlengths

$`gene-YHL043W-T1`
GRanges object with 1 range and 3 metadata columns:
        seqnames      ranges strand |   exon_id   exon_name exon_rank
           <Rle>   <IRanges>  <Rle> | <integer> <character> <integer>
  [1] BK006934.2 14901-15413      + |         3        <NA>         1
  -------
  seqinfo: 16 sequences from an unspecified genome; no seqlengths

...
<5954 more elements>

####################################################################
####################################################################
####################################################################

> footprintsGR
GRanges object with 93228528 ranges and 1 metadata column:
               seqnames    ranges strand |      size
                  <Rle> <IRanges>  <Rle> | <integer>
         [1] BK006935.2       620      + |        30
         [2] BK006935.2       724      + |        30
         [3] BK006935.2      1940      - |        30
         [4] BK006935.2      2039      - |        31
         [5] BK006935.2      2121      - |        30
         ...        ...       ...    ... .       ...
  [93228524] BK006949.2    943255      + |        31
  [93228525] BK006949.2    943256      + |        30
  [93228526] BK006949.2    943256      + |        30
  [93228527] BK006949.2    943256      + |        31
  [93228528] BK006949.2    943257      + |        30
  -------
  seqinfo: 16 sequences from an unspecified genome

The annotation is a custom gtf that has some 5'UTR annotated but not for all the transcripts (around 3800 transcripts)

Any suggestions on how to resolve this issue would be fantastic. Thank you!

Roleren commented 1 year ago

Hi, some things here: You need to decide how to handle the non UTR genes.

There are 3 ways:

  1. Remove genes with too short UTRs: tx_to_use <- filterTranscripts(txdb) And then do cds <- cds[tx_to_use] # etc
  2. You create pseudo UTRs, so that all UTRs are always valid (but you will lose 0 length information for later): makeTxdbFromGenome() new_txdb <- loadTxdb()

  3. Enforce minimum size, but keep all (go outside to genomic flank if needed):
    
    windows <- startRegion(grl, tx, TRUE, upstream, downstream)
    lengths <- widthPerGroup(windows, FALSE)

Try this function to make a range extender, we use it in ribocrypt (It will go into ORFik soon)

extend_needed <- function(windows, length, wanted_length, direction = "up") { dif <- length - wanted_length big_enough <- dif >= 0 if (!all(big_enough)) { if (direction == "up") { windows[!big_enough] <- extendLeaders(windows[!big_enough], extension = -dif[!big_enough]) } else { windows[!big_enough] <- extendTrailers(windows[!big_enough], extension = -dif[!big_enough]) }

} return(windows) } new_windows <- extend_needed(...)

Validate all windows are now equal size

new_lengths <- widthPerGroup(new_windows, FALSE) stopifnot(length(unique(new_lengths)) == 1) # All windows must be same size!

If just a few not done, just remove them (you might hit the chromosome boundary).


Then do :
`windowPerReadLength(<other_args>, windows = extended_windows)`
This will now work.

Secondly 'fraction' is multi meaning in ORFik, here it means 'readlength' is 30 and 31 etc. 
JC-therea commented 1 year ago

Hi, thank you very much for the options.

Firstly, I tried the first and the third pipelines and both worked great!

In case anyone needs it the code for the third pipeline is the following:

# Define the windows to expand
windows <- startRegion(cds, tx, TRUE, upstream = 0, downstream = 10)
lengths <- widthPerGroup(windows, FALSE)

# Your new function
extend_needed <- function(windows, length, wanted_length, direction = "up") {
  dif <- length - wanted_length
  big_enough <- dif >= 0
  if (!all(big_enough)) {
    if (direction == "up") {
      windows[!big_enough] <- extendLeaders(windows[!big_enough],
                                            extension = -dif[!big_enough])
    } else {
      windows[!big_enough] <- extendTrailers(windows[!big_enough],
                                             extension = -dif[!big_enough])
    }

  }
  return(windows)
}

# Create the new windows
new_windows <- extend_needed(windows, lengths, 30, direction = "up")
# Check windows size
new_lengths <- widthPerGroup(new_windows, FALSE)
stopifnot(length(unique(new_lengths)) == 1) # All windows must be same size!
# If just a few not done, just remove them (you might hit the chromosome boundary).
hitMap <- windowPerReadLength(grl = cds, tx = tx, reads = footprintsGR, pShifted = FALSE, windows = new_windows)
coverageHeatMap(hitMap, scoring = "transcriptNormalized")

The code produces the following figure which is not precise at all but it's ok as far as the df it is (On the df the peak of 30mers is at -12 and in 31mers at -13).

Rplot

Secondly, I apologize because I got the name wrong. As in the figure produced by coverageHeatMap() the legend indicates "transcript normalized" I thought that the score (not the fraction) was normalized for the transcript length.

Roleren commented 1 year ago

Great, will close this issue then, if you have no more questions ? :)

Also check out our new page: RiboCrypt.org, where we host thousands of precomputed Ribo-seq datasets with easy interactive visualizations. Let me know if you have any questions about that too :)

JC-therea commented 1 year ago

Sure!

JC-therea commented 1 year ago

Is there a way to do the same but with orfik experiments? I am planning to do the same but with more samples.