Closed NastiaSkuba closed 6 months ago
Very impressive work here. I think you have a malformed txdb, can you run the ORFik::makeTxdbFromGenome function with optimize as true. Then use that txdb instead, should work. Let me know if it does not.
Thank you for your fast response!
I tried the original .gff3
file, .gtf
files obtained with AGAT
and gffread
with parameter optimize=TRUE
Nothing worked, seqinfo(loadTxdb(df_Ribo))
always gives the same result, and the BiocParallel error is also the same.
I realized I already had had problems with this genome, see reference. The maize genome has empty lines. Is it possible to work with such a genome in ORFik?
Could you describe how you run txdb step above for gtf and gff3. And print the seqinfo from each of the txdbs made, I'm quite sure it is there the error is in that it has undefined seqlengths.
Today I repeated everything with files from http://oct2017-plants.ensembl.org/info/website/ftp/index.html
But this code is representative for any file I tried and the output is the same.
organism <- "Zea mays"
genome <- "../genome/Zea_mays_Ensembl37/Zea_mays.AGPv4.dna.toplevel.fa"
gtf <- "../genome/Zea_mays_Ensembl37/Zea_mays.AGPv4.37.gtf" # this string containing path to the file and can be changed accordingly
txdb_file <- ORFik::makeTxdbFromGenome(gtf, genome, organism = organism, optimize = TRUE, return = TRUE)
Output:
Making txdb of GTF
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... Warning: The "phase" metadata column contains non-NA values for features of type stop_codon. This information
was ignored.OK
--------------------------
Txdb stored at: ../genome/Zea_mays_Ensembl37/Zea_mays.AGPv4.37.gtf.db
--------------------------
Optimizing annotation, saving to: ../genome/Zea_mays_Ensembl37/ORFik_optimized
Creating fst speedup file for transcript lengths, at location:
../genome/Zea_mays_Ensembl37/ORFik_optimized/Zea_mays.AGPv4.37_2024-05-08094034+0200_txLengths.fst
Creating rds speedup files for transcript regions
Results of seqinfo(loadTxdb(txdb_file))
:
Seqinfo object with 121 sequences (2 circular) from an unspecified genome; no seqlengths:
NA
seqlengths isCircular genome
1 NA NA NA
2 NA NA NA
3 NA NA NA
4 NA NA NA
5 NA NA NA
... ... ... ...
B73V4_ctg92 NA NA NA
B73V4_ctg95 NA NA NA
B73V4_ctg97 NA NA NA
B73V4_ctg98 NA NA NA
Pt NA TRUE NA
With any version of MaizeGDB files, I used before I am getting the same seqinfo() result:
Seqinfo object with 108 sequences (2 circular) from an unspecified genome; no seqlengths:
NA
seqlengths isCircular genome
Mt NA TRUE NA
B73V4_ctg10 NA NA NA
B73V4_ctg100 NA NA NA
B73V4_ctg102 NA NA NA
B73V4_ctg103 NA NA NA
... ... ... ...
Chr6 NA NA NA
Chr7 NA NA NA
Chr8 NA NA NA
Chr9 NA NA NA
Pt NA TRUE NA
Only exception is .gff3 format, where I additionally got the next table:
seqid start end strand ID Name Parent Parent_type 1 Chr1 572435 572663 + NA Zm00001d022649_T001.exon1 transcript:Zm00001d022649_T001 NA 2 Chr1 1030351 1031090 + NA Zm00001d022650_T001.exon1 transcript:Zm00001d022650_T001 NA 3 Chr1 1540407 1540586 - NA Zm00001d022651_T001.exon1 transcript:Zm00001d022651_T001 NA 4 Chr1 1918444 1918644 - NA Zm00001d022652_T001.exon1 transcript:Zm00001d022652_T001 NA 5 Chr1 2170280 2170492 - NA Zm00001d022653_T001.exon2 transcript:Zm00001d022653_T001 NA 6 Chr1 2170757 2170983 - NA Zm00001d022653_T001.exon1 transcript:Zm00001d022653_T001 NA
Also I found the BioStars with exactly the same issue: https://www.biostars.org/p/9552773/ Was the problem solved in that case?
Another possibly related issue, after running detectRibosomeShifts() function, I immediately received the error:
Error in loadTxdb(txdb) : txdb must be path, list or TxDb
Hm, that is not what I get:
library(ORFik)
> annotation <- getGenomeAndAnnotation("Zea mays", output.dir = file.path(config()["ref"], "zea_mays/"), optimize = TRUE)
Loading premade Genome files, do remake = TRUE if you want to run again
> seqinfo(loadTxdb(paste0(annotation["gtf"], ".db")))
Seqinfo object with 685 sequences from an unspecified genome:
seqnames seqlengths isCircular genome
1 308452471 <NA> <NA>
2 243675191 <NA> <NA>
3 238017767 <NA> <NA>
4 250330460 <NA> <NA>
5 226353449 <NA> <NA>
... ... ... ...
scaf_691 31489 <NA> <NA>
scaf_692 31211 <NA> <NA>
scaf_693 30818 <NA> <NA>
scaf_694 30512 <NA> <NA>
scaf_695 30084 <NA> <NA>
> anyNA(seqlengths(seqinfo(loadTxdb(paste0(annotation["gtf"], ".db")))))
[1] FALSE
Reference assembly I get is: Zea_mays.Zm-B73-REFERENCE-NAM-5.0.56
After R update and all the packages reinstallation, I have got the function working!
So currently I am using R=4.3.3 at Mac M2 laptop. Unfortunately, I could not run STAR there, because ORFik did not detect fastp installation for this processor. Nevertheless, I can execute alignment on Ubuntu machine. But there the annotation is still malformed. Thank you for your help!
Hi,
I am working with maize Ribo-seq data.
All genome files I have got from maizegdb: https://download.maizegdb.org/Zm-B73-REFERENCE-GRAMENE-4.0/
.gff3 file was transformed to .gtf using AGAT. (With original .gff3 file eventually I got the same results, but additionally some errors, like:
Warning: gff-version directive indicates version is 3 , not 3C
)I aligned trimmed with Trim_galore! reads to the genome using ORFik:
Then I created corresponding experiments and trying to perform
QCreport()
andshiftFootprintsByExperiment()
I am getting pdf files STATS_plot, cor_plot, and PCA_plot. The main problem is that afterwards I am getting a BiocParallel error, for example:
or
I already tried adding
BPPARAM = SerialParam()
, and reinstalling ORFik package, both did not help.Command
returned