Roleren / ORFik

MIT License
32 stars 9 forks source link

Error with "detectRibosomeShifts()" #177

Closed JC-therea closed 6 months ago

JC-therea commented 6 months ago

Hi,

First of all, thank you for your tool. I developed an internal pipeline that uses ORFik to estimate the offset of ribosome profiling samples and normally it works perfectly.

However, I'm facing a problem with my non-canonical organism. Normally what I do is the following:

suppressMessages(library(ORFik))
suppressMessages(library(dplyr))

args <- commandArgs(trailingOnly = TRUE)
gtf <- args[1]
bam_file <- args[2]

footprints <- readBam(bam_file)
txdb <- suppressMessages(GenomicFeatures::makeTxDbFromGFF(gtf, chrominfo = seqinfo(footprints)))
cds <- loadRegion(txdb, part = "cds")
shifts <- suppressWarnings(detectRibosomeShifts(footprints, txdb, verbose = F , accepted.lengths = 25:35))

And normally works fine if I have UTRs in my annotation. However when I don't have UTRs as you probably know I get the following error:

Error in filterTranscripts(txdb, minFiveUTR, minCDS, minThreeUTR) : 
  No transcript has leaders and trailers of specified minFiveUTR minCDS, minThreeUTR, create pseudo-UTRs, see annotation vignette!

Which is normal I get that I have to add pseudo-UTRs to make it work and I tried with some command that I saw in your tool:

txdb <- assignTSSByCage(gtf, cage = NULL, pseudoLength = 50) However, that makes me an additional error which is the following:

Error in mapSeqlevels(x_seqlevels, value, drop = FALSE) : 
  supplied seqname style "RefSeq" is not supported

Checking the seqlevels from the bam file and the txdb I obtained the same chromosomes:

> seqlevels(txdb)
 [1] "NC_018044.1" "NC_047487.1" "NC_047488.1" "NC_047489.1" "NC_047490.1"
 [6] "NC_047491.1" "NC_047492.1" "NC_047493.1" "NC_047494.1" "NC_047495.1"
[11] "NC_047496.1" "NC_047497.1" "NC_047498.1" "NC_047499.1" "NC_047500.1"
[16] "NC_047501.1" "NC_047502.1"
> seqlevels(footprints)
 [1] "NC_047487.1" "NC_047488.1" "NC_047489.1" "NC_047490.1" "NC_047491.1"
 [6] "NC_047492.1" "NC_047493.1" "NC_047494.1" "NC_047495.1" "NC_047496.1"
[11] "NC_047497.1" "NC_047498.1" "NC_047499.1" "NC_047500.1" "NC_047501.1"
[16] "NC_047502.1" "NC_018044.1"

So Im a little bit lost and I don't know if its better to artificially create a gtf with "artificial" UTRs or if I should use another command.

Thanks in advance for your help

Carlos

Roleren commented 6 months ago

I Made this function that is safer, it is supposed to catch this error:

ORFik::makeTxdbFromGenome(gtf = ..., genome = ..., organism = ..., optimize = TRUE, pseudo_5UTRS_if_needed = 50) # 50nt pseudo UTR, do this once only!
txdb <- loadTxdb(paste0(gtf, ".db")) # Use this to load

Note this is also possible to create on download of new species with: getGenomeAndAnnotation(..., optimize = TRUE, pseudo_5UTRS_if_needed = 50)

Let me know if it worked

JC-therea commented 6 months ago

Hello again,

I m having the same error:

> ORFik::makeTxdbFromGenome(gtf = gtf, genome = genome, organism = "Saccharomyces paradoxus", optimize = TRUE, pseudo_5UTRS_if_needed = 50) # 50nt pseudo UTR, do this once only!
Making txdb of GTF
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
---- Adding Pseudo 5' UTRs
- Total leaders: 0
- Leader length statistics:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 

- Percentage of CDS' with leaders: 0%
- Adding pseudo
--------------------------
Txdb stored at: Outputs/QC/Annotation/Saccharomyces_paradoxus_CBS432.gtf.db
--------------------------
Optimizing annotation, saving to: Outputs/QC/Annotation/ORFik_optimized
Creating fst speedup file for transcript lengths, at location:
Outputs/QC/Annotation/ORFik_optimized/Saccharomyces_paradoxus_CBS432_2024-05-23175255+0200_txLengths.fst
fstcore package v0.9.14
(OpenMP detected, using 20 threads)
Creating rds speedup files for transcript regions
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.
> txdb <- loadTxdb(paste0(gtf, ".db")) # Use this to load
Loading required package: GenomicFeatures
Loading required package: AnnotationDbi

Attaching package: ‘AnnotationDbi’

The following object is masked from ‘package:dplyr’:

    select

> 
> ##############
> #txdb <- assignTSSByCage(gtf, cage = NULL, pseudoLength = 50)
> shifts <- suppressWarnings(detectRibosomeShifts(footprints, txdb, verbose = F , accepted.lengths = 25:35))
Error in mapSeqlevels(x_seqlevels, value, drop = FALSE) : 
  supplied seqname style "RefSeq" is not supported

In addition, I would like to ask you regarding the arguments of the function... How relevant is the argument organism for making work the function? I ask because I would like to do something generalistic without the need of including each time the organism.

If you want to check the files are here

Roleren commented 6 months ago

Hm, will check it out, Organism is only required for gene symbol fetching, so if you do not need that then skip it :)

Roleren commented 6 months ago

Downloaded your data, and found the error.

I have pushed a fix now, give it a go :)

JC-therea commented 6 months ago

Nice commit hahaha. Now it works like a charm.

Thank you very much!