Roleren / ORFik

MIT License
33 stars 9 forks source link

Error in shiftFootprints #123

Closed mahabubp closed 2 years ago

mahabubp commented 2 years ago

Hi Roleren,

When I run shiftedReads <- shiftFootprints(reads, detectRibosomeShifts(reads, txdb)), I get the following error. Could you please look into it? Thanks

Error in covRleFromGR(x, weight = weight, ignore.strand = ignore.strand) : Seqlengths of x contains NA values!

Roleren commented 2 years ago

Hm, yeah, think I know what it is, will add some tests to avoid this in the future, will let you know when I push to github

Roleren commented 2 years ago

First, could you do:

devtools::install_github("Roleren/ORFik") then restart R and try again, I think this is a bug on main branch and that I fixed it earlier.

mahabubp commented 2 years ago

Okay. I tried it but it didn't work. Its showing the same error.

Roleren commented 2 years ago

Ok, does the example work for you?

library(ORFik)
gtf_file <- system.file("extdata/Danio_rerio_sample", "annotations.gtf", package = "ORFik")
riboSeq_file <- system.file("extdata/Danio_rerio_sample", "ribo-seq.bam", package = "ORFik")
footprints <- readBam(riboSeq_file)
shifts <- detectRibosomeShifts(footprints, gtf_file)
shiftedReads <- shiftFootprints(footprints, shifts)
mahabubp commented 2 years ago

yes, your example did work for me. I wonder why it doesn't work for my files.

Roleren commented 2 years ago

Ok, print me these 5 things from what you wrote on top (show me consol output), then I should know what it is:

reads
seqinfo(reads)
txdb
seqinfo(txdb)
sessionInfo()
mahabubp commented 2 years ago

Okay. here it is.

library(ORFik) gtf_file <- "~/Homo_sapiens.GRCh38.107.gtf" riboSeq_file <- "/media/lrge/32305f08-3e34-4bbb-94db-3f17f7760b86/lrge/Pasha/STAR/S2620_7_R1/S2620_7_R1_Aligned.sortedByCoord.out.bam" footprints <- readBam(riboSeq_file) shifts <- detectRibosomeShifts(footprints, gtf_file) Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK Error in covRleFromGR(x, weight = weight, ignore.strand = ignore.strand) : Seqlengths of x contains NA values! In addition: Warning message: In .get_cds_IDX(mcols0$type, mcols0$phase) : The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored. shiftedReads <- shiftFootprints(footprints, shifts) Error in is(shifts, "data.frame") : object 'shifts' not found

Roleren commented 2 years ago

Ok, from what you wrote print me:

footprints
seqinfo(footprints)
txdb <- loadTxdb(gtf_file)
txdb
seqinfo(txdb)
sessionInfo()
mahabubp commented 2 years ago

Here you go..

print(footprints) GAlignments object with 22307236 alignments and 0 metadata columns: seqnames strand cigar qwidth start end width njunc

[1] 1 - 66M 66 14694 14759 66 0 [2] 1 - 2S67M 69 14695 14761 67 0 [3] 1 - 61M 61 16071 16131 61 0 [4] 1 - 61M 61 16071 16131 61 0 [5] 1 - 33M296N29M 62 16278 16635 358 1 ... ... ... ... ... ... ... ... ... [22307232] KI270752.1 - 15M 15 8029 8043 15 0 [22307233] KI270752.1 + 19M1S 20 22153 22171 19 0 [22307234] KI270752.1 + 19M1S 20 22153 22171 19 0 [22307235] KI270512.1 + 2S19M 21 20511 20529 19 0 [22307236] KI270466.1 - 63M 63 1005 1067 63 0 ------- seqinfo: 194 sequences from an unspecified genome print(seqinfo(footprints)) Seqinfo object with 194 sequences from an unspecified genome: seqnames seqlengths isCircular genome 1 248956422 10 133797422 11 135086622 12 133275309 13 114364328 ... ... ... ... KI270539.1 993 KI270385.1 990 KI270423.1 981 KI270392.1 971 KI270394.1 970 print(txdb <- loadTxdb(gtf_file)) Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK TxDb object: # Db type: TxDb # Supporting package: GenomicFeatures # Data source: ~/Homo_sapiens.GRCh38.107.gtf # Organism: NA # Taxonomy ID: NA # miRBase build ID: NA # Genome: NA # Nb of transcripts: 251121 # Db created by: GenomicFeatures package from Bioconductor # Creation time: 2022-11-04 13:41:34 +0100 (Fri, 04 Nov 2022) # GenomicFeatures version at creation time: 1.50.1 # RSQLite version at creation time: 2.2.18 # DBSCHEMAVERSION: 1.2 Warning message: In .get_cds_IDX(mcols0$type, mcols0$phase) : The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored. print(txdb) TxDb object: # Db type: TxDb # Supporting package: GenomicFeatures # Data source: ~/Homo_sapiens.GRCh38.107.gtf # Organism: NA # Taxonomy ID: NA # miRBase build ID: NA # Genome: NA # Nb of transcripts: 251121 # Db created by: GenomicFeatures package from Bioconductor # Creation time: 2022-11-04 13:41:34 +0100 (Fri, 04 Nov 2022) # GenomicFeatures version at creation time: 1.50.1 # RSQLite version at creation time: 2.2.18 # DBSCHEMAVERSION: 1.2 print(seqinfo(txdb)) Seqinfo object with 47 sequences (1 circular) from an unspecified genome; no seqlengths: seqnames seqlengths isCircular genome 1 2 3 4 5 ... ... ... ... KI270731.1 KI270733.1 KI270734.1 KI270744.1 KI270750.1 print(sessionInfo()) R version 4.2.1 (2022-06-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.1 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=cs_CZ.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=cs_CZ.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=cs_CZ.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=cs_CZ.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] data.table_1.14.4 ORFik_1.17.8 GenomicAlignments_1.34.0 Rsamtools_2.14.0
[5] Biostrings_2.66.0 XVector_0.38.0 SummarizedExperiment_1.28.0 MatrixGenerics_1.10.0
[9] matrixStats_0.62.0 GenomicFeatures_1.50.1 AnnotationDbi_1.60.0 Biobase_2.58.0
[13] GenomicRanges_1.50.0 GenomeInfoDb_1.34.0 IRanges_2.32.0 S4Vectors_0.36.0
[17] BiocGenerics_0.44.0

loaded via a namespace (and not attached): [1] biomartr_1.0.2 bitops_1.0-7 bit64_4.0.5 filelock_1.0.2 RColorBrewer_1.1-3
[6] progress_1.2.2 httr_1.4.4 tools_4.2.1 utf8_1.2.2 R6_2.5.1
[11] DBI_1.1.3.9003 colorspace_2.0-3 gridExtra_2.3 tidyselect_1.2.0 prettyunits_1.1.1
[16] DESeq2_1.38.0 bit_4.0.4 curl_4.3.3 compiler_4.2.1 cli_3.4.1
[21] xml2_1.3.3 DelayedArray_0.24.0 fstcore_0.9.12 rtracklayer_1.58.0 scales_1.2.1
[26] genefilter_1.80.0 rappdirs_0.3.3 stringr_1.4.1 digest_0.6.30 R.utils_2.12.1
[31] pkgconfig_2.0.3 fst_0.9.8 dbplyr_2.2.1 fastmap_1.1.0 BSgenome_1.66.0
[36] rlang_1.0.6 rstudioapi_0.14 RSQLite_2.2.18 BiocIO_1.8.0 generics_0.1.3
[41] jsonlite_1.8.3 BiocParallel_1.32.0 dplyr_1.0.10 R.oo_1.25.0 RCurl_1.98-1.9
[46] magrittr_2.0.3 GenomeInfoDbData_1.2.9 Matrix_1.5-1 Rcpp_1.0.9 munsell_0.5.0
[51] fansi_1.0.3 lifecycle_1.0.3 R.methodsS3_1.8.2 stringi_1.7.8 yaml_2.3.6
[56] zlibbioc_1.44.0 BiocFileCache_2.6.0 grid_4.2.1 blob_1.2.3 parallel_4.2.1
[61] crayon_1.5.2 lattice_0.20-45 cowplot_1.1.1 splines_4.2.1 annotate_1.76.0
[66] hms_1.1.2 KEGGREST_1.38.0 locfit_1.5-9.6 pillar_1.8.1 rjson_0.2.21
[71] geneplotter_1.76.0 codetools_0.2-18 biomaRt_2.54.0 XML_3.99-0.12 glue_1.6.2
[76] png_0.1-7 vctrs_0.5.0 gtable_0.3.1 assertthat_0.2.1 cachem_1.0.6
[81] ggplot2_3.3.6 xtable_1.8-4 restfulr_0.0.15 survival_3.4-0 tibble_3.1.8
[86] memoise_2.0.1 ellipsis_0.3.2

Roleren commented 2 years ago

What is going on with your seqinfo values, there should be NA values for seqlength of txdb Not blanks as you have, here is how I print it:

> seqinfo(footprints)
Seqinfo object with 1133 sequences from an unspecified genome:
  seqnames   seqlengths isCircular genome
  MT              16596       <NA>   <NA>
  Zv9_NA1         22128       <NA>   <NA>
  Zv9_NA10        48062       <NA>   <NA>
  Zv9_NA100       26026       <NA>   <NA>
  Zv9_NA1000        942       <NA>   <NA>
  ...               ...        ...    ...
  chr5         75682077       <NA>   <NA>
  chr6         59938731       <NA>   <NA>
  chr7         77276063       <NA>   <NA>
  chr8         56184765       <NA>   <NA>
  chr9         58232459       <NA>   <NA>

Do you not get values printed ?

mahabubp commented 2 years ago

print(seqinfo(footprints)) Seqinfo object with 194 sequences from an unspecified genome: seqnames seqlengths isCircular genome 1 248956422 10 133797422 11 135086622 12 133275309 13 114364328 ... ... ... ... KI270539.1 993 KI270385.1 990 KI270423.1 981 KI270392.1 971 KI270394.1 970

Roleren commented 2 years ago

you get same as me here ?

> footprints@seqinfo@is_circular[1]
[1] NA
mahabubp commented 2 years ago

yes

Roleren commented 2 years ago

And this?

> class(seqinfo(txdb)@seqlengths[1])
[1] "integer"
mahabubp commented 2 years ago

yes

Roleren commented 2 years ago

And this does not crash for you too ?

> seqinfo(footprints)[names(seqinfo(txdb))]
Seqinfo object with 2 sequences from an unspecified genome:
  seqnames seqlengths isCircular genome
  chr8       56184765         NA   <NA>
  chr24      43947580         NA   <NA>
mahabubp commented 2 years ago

yes, it doesn't. I see the similar pattern as yours.

Roleren commented 2 years ago

Ok, then this is very mysterious, we need to debug: Run this:

debug(ORFik:::covRleFromGR)
shifts <- detectRibosomeShifts(footprints, txdb)

And when have entered covRleFromGr, print me directly:

seqinfo(x)
Seqinfo object with 2 sequences from an unspecified genome:
  seqnames seqlengths isCircular genome
  chr24      43947580         NA   <NA>
  chr8       56184765         NA   <NA>

For you then there is most likely some NA in there for seqlengths

mahabubp commented 2 years ago

seqinfo(x) Seqinfo object with 47 sequences (1 circular) from an unspecified genome: seqnames seqlengths isCircular genome 1 248956422 10 133797422 11 135086622 12 133275309 13 114364328 ... ... ... ... KI270442.1 KI270726.1 KI270733.1 KI270744.1 KI270750.1

mahabubp commented 2 years ago

Yes, before there were some values for the following, now they disappeared, showing only NA for them KI270442.1 KI270726.1 KI270733.1 KI270744.1 KI270750.1

Roleren commented 2 years ago

Aha, we found a bug in GenomicFeatures package then, of bioconductor core, this should be reported, I will see if I have time. Will write up a temp fix for you now, just have to check it works

mahabubp commented 2 years ago

Hmm okay. Thanks a ton for your time and help.. Hope temp fix would work.

Roleren commented 2 years ago

Try this:

undebug(ORFik:::covRleFromGR)
# Now let's make a txdb with valid seqinfo, hopefully footprints have all seqlevels we need
txdb <- GenomicFeatures::makeTxDbFromGFF(gtf_file, organism = "Homo sapiens",
                                         chrominfo = seqinfo(footprints))
shifts <- detectRibosomeShifts(footprints, txdb)

Let me know if txdb step does not work, then we need to get the 100% correct seqinfo from the fasta index of the genome itself. Update! Remember to use txdb in shift function, not the original gtf!

mahabubp commented 2 years ago

Yes, it worked. Thanks a ton.. Please let me know if I should close the issue?

Roleren commented 2 years ago

Yes, good that it worked. This error is a bit dangerous and should be fixed, but at least we have a work around. You can now close this issue :)