PoolLab / ReferenceEnhancer

Other
20 stars 5 forks source link

OverlapResolutions Error with mm39 ensembl GTF #7

Open jkniehaus opened 7 months ago

jkniehaus commented 7 months ago

Hello,

Thanks for the tool. I successfully went through your test files.

I'm trying to generate an optimized annotation for ensembl's latest mm39 annotation and am running into an error during the OverlapResolutions function:

Error in if (overlap_data[item, "number_of_gene_overlaps"] > 1) { : missing value where TRUE/FALSE needed Calls: OverlapResolutions

Do gtf files need to be processed or formatted in any way? I'm guessing this error might arise from an NA or something. Any guidance is appreciated (or if you have an optimized mm39 gtf readily available, that'd be great too).

Thanks! Jesse

Code below:

library(ReferenceEnhancer)
genome_annotation <- LoadGtf(unoptimized_annotation_path = "Mus_musculus.GRCm39.111.gtf")
gene_overlaps <- IdentifyOverlappers(genome_annotation = genome_annotation)
OverlapResolutions(genome_annotation = genome_annotation, overlap_data = gene_overlaps, gene_pattern = c("Rik$", "^Gm"))

sessionInfo(): R version 4.3.1 (2023-06-16) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux 8.8 (Ootpa)

Matrix products: default BLAS/LAPACK: /nas/longleaf/rhel8/apps/r/4.3.1/lib/libopenblas_zenp-r0.3.23.so; LAPACK version 3.11.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/New_York tzcode source: system (glibc)

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] compiler_4.3.1

jkniehaus commented 7 months ago

Looks like NAs were getting created in the 'genes' and 'overlapping_genes' columns. the function worked after removing these from the 'overlapping_gene_list' object.

jkniehaus commented 3 weeks ago

Removing NAs resolved the first problem, but another persists during OverlapResolutions function. Perhaps ReferenceEnhancer requires filtering a gtf to some degree. Directly using one downloaded from Ensembl does not seem to work.

library(ReferenceEnhancer)
library(dplyr)

genome_annotation <- LoadGtf(unoptimized_annotation_path = "Mus_musculus.GRCm39.111.gtf")
genome_annotation <- genome_annotation %>%
    mutate(gene_name = coalesce(gene_name, gene_id)) #remove NAs and replace w/ gene_id
gene_overlaps <- IdentifyOverlappers(genome_annotation = genome_annotation)
OverlapResolutions(genome_annotation = genome_annotation, overlap_data = gene_overlaps, gene_pattern = c("Rik$", "^Gm"))
Error in seq.default(from = gene_A_exons[row_exonA, 1], to = gene_A_exons[row_exonA,  : 
  wrong sign in 'by' argument