Roleren / ORFik

MIT License
32 stars 9 forks source link

Problem with computeFeatures #144

Closed JC-therea closed 1 year ago

JC-therea commented 1 year ago

Hello Hakon,

I am working right now with the human genome and for the site I was able to manage Orfik however I am having troubles with the function computeFeatures().

I am using custom gtf and fasta files to do the analysis that I did with other organisms. Here is the code that I used:

# Load files
gtf <- "Human.gtf"
bam_file <- "Test.bam"
genome <- readDNAStringSet("GRCh37.primary_assembly.genome.fa")
names(genome) <- gsub(pattern = " .*", replacement = "",x = names(genome))

# Estimate the psite
footprints <- readBam(bam_file)
txdb <- GenomicFeatures::makeTxDbFromGFF(gtf, chrominfo = seqinfo(footprints))
cds <- loadRegion(txdb, part = "cds")
shifts <- detectRibosomeShifts(footprints, txdb, verbose = F ,must.be.periodic = T, accepted.lengths = 25:35)
shiftedFootprints <- shiftFootprints(footprints, shifts)
shiftedFootprintsCollapsed <- collapseDuplicatedReads(shiftedFootprints, addSizeColumn = TRUE)

# Test computeFeatures
fiveUTRs <- loadRegion(txdb, "leaders")
fiveUTRs <- fiveUTRs[10:20]
faFile <- genome
tx_seqs <- extractTranscriptSeqs(faFile, fiveUTRs)
fiveUTR_ORFs <- findMapORFs(fiveUTRs, tx_seqs, groupByTx = FALSE)
dt <- computeFeatures(grl = fiveUTRs, RFP = shiftedFootprintsCollapsed, RNA = NULL, Gtf = txdb, faFile =  genome, sequenceFeatures = FALSE, weight.RFP = shiftedFootprintsCollapsed$score)

And the error is the following:

No RNA added, skipping feature te and fpkm of RNA, also ribosomeReleaseScore will also be not normalized best way possible.
Error: subscript contains invalid names

I checked all the inputs and seems to be fine the only thing that I found was a big number of warnings loading the gtf file:

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored.
2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
  the transcript names ("tx_name" column in the TxDb object) imported from the "transcript_id" attribute are not unique
3: In .merge_transcript_parts(transcripts) :
  The following transcripts were dropped because they have multiple parts that could not be merged due to incompatible seqnames: ENST00000244174.10_4, ENST00000262640.11_1, ENST00000286448.12_2,
  ENST00000313871.9_2, ENST00000326153.9_3, ENST00000331035.10_3, ENST00000334060.8_1, ENST00000334651.11_4, ENST00000340131.12_1, ENST00000355432.8_2, ENST00000355805.7_1, ENST00000359512.8_1,
  ENST00000369423.7_4, ENST00000381177.7_4, ENST00000381180.9_3, ENST00000381184.6_3, ENST00000381187.8_1, ENST00000381192.10_3, ENST00000381218.8_3, ENST00000381222.8_4, ENST00000381223.9_3, ENST00000381229.9_2,
  ENST00000381233.8_1, ENST00000381241.9_3, ENST00000381261.8_1, ENST00000381297.10_4, ENST00000381317.9_5, ENST00000381333.9_4, ENST00000381401.11_4, ENST00000381469.7_2, ENST00000381500.6_2,
  ENST00000381509.8_2, ENST00000381524.8_3, ENST00000381529.8_3, ENST00000381566.6_3, ENST00000381567.8_3, ENST00000381575.6_1, ENST00000381578.6_3, ENST00000381625.9_6, ENST00000381657.8_6, ENST00000381663. [... truncated]
4: In makeTxDbFromGRanges(gr, metadata = metadata) :
  The following transcripts were dropped because their exon ranks could not be inferred (either because the exons are not on the same chromosome/strand or because they are not separated by introns):
  ENST00000244174.10_4, ENST00000262640.11_1, ENST00000286448.12_2, ENST00000313871.9_2, ENST00000326153.9_3, ENST00000331035.10_3, ENST00000334060.8_1, ENST00000334651.11_4, ENST00000340131.12_1,
  ENST00000355432.8_2, ENST00000355805.7_1, ENST00000359512.8_1, ENST00000369423.7_4, ENST00000381177.7_4, ENST00000381180.9_3, ENST00000381184.6_3, ENST00000381187.8_1, ENST00000381192.10_3, ENST00000381218.8_3,
  ENST00000381222.8_4, ENST00000381223.9_3, ENST00000381229.9_2, ENST00000381233.8_1, ENST00000381241.9_3, ENST00000381261.8_1, ENST00000381297.10_4, ENST00000381317.9_5, ENST00000381333.9_4,
  ENST00000381401.11_4, ENST00000381469.7_2, ENST00000381500.6_2, ENST00000381509.8_2, ENST00000381524.8_3, ENST00000381529.8_3, ENST00000381566.6_3, ENST00000381567.8_3, ENST00000381575.6_1, ENST00000 [... truncated]
5: In makeTxDbFromGRanges(gr, metadata = metadata) :
  The following transcripts were dropped because no genomic ranges could be found for them and their ranges could not be inferred from their exons either (because they have them on more than one chromosome):
  ENST00000244174.10_4, ENST00000262640.11_1, ENST00000286448.12_2, ENST00000313871.9_2, ENST00000326153.9_3, ENST00000331035.10_3, ENST00000334060.8_1, ENST00000334651.11_4, ENST00000340131.12_1,
  ENST00000355432.8_2, ENST00000355805.7_1, ENST00000359512.8_1, ENST00000369423.7_4, ENST00000381177.7_4, ENST00000381180.9_3, ENST00000381184.6_3, ENST00000381187.8_1, ENST00000381192.10_3, ENST00000381218.8_3,
  ENST00000381222.8_4, ENST00000381223.9_3, ENST00000381229.9_2, ENST00000381233.8_1, ENST00000381241.9_3, ENST00000381261.8_1, ENST00000381297.10_4, ENST00000381317.9_5, ENST00000381333.9_4,
  ENST00000381401.11_4, ENST00000381469.7_2, ENST00000381500.6_2, ENST00000381509.8_2, ENST00000381524.8_3, ENST00000381529.8_3, ENST00000381566.6_3, ENST00000381567.8_3, ENST00000381575.6_1,  [... truncated]

I ran the same pipeline before with another genome version and worked... So now I am a little bit lost.

Any help that you could provide me would be perfect!

Extra info in case you need it:

sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS/LAPACK: /soft/system/software/OpenBLAS/0.3.12-GCC-10.2.0/lib/libopenblas_prescottp-r0.3.12.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] data.table_1.14.2           GenomicFeatures_1.46.5      AnnotationDbi_1.56.2        ORFik_1.21.0                GenomicAlignments_1.30.0    Rsamtools_2.10.0            Biostrings_2.62.0           XVector_0.34.0             
 [9] SummarizedExperiment_1.24.0 Biobase_2.54.0              MatrixGenerics_1.6.0        matrixStats_0.61.0          GenomicRanges_1.46.1        GenomeInfoDb_1.30.0         IRanges_2.28.0              S4Vectors_0.32.3           
[17] BiocGenerics_0.40.0        

loaded via a namespace (and not attached):
 [1] biomartr_0.9.2         bitops_1.0-7           bit64_4.0.5            filelock_1.0.2         RColorBrewer_1.1-2     progress_1.2.2         httr_1.4.2             tools_4.1.2            utf8_1.2.2             R6_2.5.1              
[11] DBI_1.1.2              colorspace_2.0-2       gridExtra_2.3          tidyselect_1.2.0       prettyunits_1.1.1      DESeq2_1.34.0          bit_4.0.4              curl_4.3.2             compiler_4.1.2         cli_3.6.1             
[21] xml2_1.3.3             DelayedArray_0.20.0    rtracklayer_1.54.0     fstcore_0.9.14         scales_1.2.1           genefilter_1.76.0      rappdirs_0.3.3         stringr_1.4.0          digest_0.6.29          R.utils_2.11.0        
[31] pkgconfig_2.0.3        fst_0.9.8              dbplyr_2.1.1           fastmap_1.1.0          BSgenome_1.62.0        rlang_1.1.1            rstudioapi_0.13        RSQLite_2.2.9          BiocIO_1.4.0           generics_0.1.3        
[41] jsonlite_1.7.2         BiocParallel_1.28.3    dplyr_1.1.2            R.oo_1.24.0            RCurl_1.98-1.5         magrittr_2.0.1         GenomeInfoDbData_1.2.7 Matrix_1.4-1           Rcpp_1.0.8             munsell_0.5.0         
[51] fansi_1.0.2            lifecycle_1.0.3        R.methodsS3_1.8.1      stringi_1.7.6          yaml_2.2.1             zlibbioc_1.40.0        BiocFileCache_2.2.0    grid_4.1.2             blob_1.2.2             parallel_4.1.2        
[61] crayon_1.4.2           lattice_0.20-45        cowplot_1.1.1          splines_4.1.2          annotate_1.72.0        hms_1.1.1              KEGGREST_1.34.0        locfit_1.5-9.4         pillar_1.9.0           rjson_0.2.21          
[71] geneplotter_1.72.0     biomaRt_2.50.2         XML_3.99-0.8           glue_1.6.0             png_0.1-7              vctrs_0.6.2            gtable_0.3.0           assertthat_0.2.1       cachem_1.0.6           ggplot2_3.4.0         
[81] xtable_1.8-4           restfulr_0.0.13        survival_3.2-13        tibble_3.2.1           memoise_2.0.1          ellipsis_0.3.2      
Roleren commented 1 year ago

Always make txdb with seqinfo from fasta , not gtf or bam. Bam often have missing chromosomes, this is what happens here I believe :)

JC-therea commented 1 year ago

Hello again. We have realized that we made mistakes in our annotations and genomes, and the problem is now fixed. However, we implemented the recommendation that you said.

Thank you for your help!