Bioconductor / GenomicFeatures

Query the gene models of a given organism/assembly
https://bioconductor.org/packages/GenomicFeatures
26 stars 12 forks source link

Unable to parse bacterial GFF files #30

Closed jambler24 closed 2 years ago

jambler24 commented 3 years ago

When trying to import a gff for Mycobacterium tuberculosis from the NCBI, the following error is received:

Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK Warning message: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, : The following transcripts have multiple parts that were merged: gene-Rv3216

Code:

txdb <-makeTxDbFromGFF("/path/to/file/GCA_000195955.2_ASM19595v2_genomic.gff", organism="Mycobacterium tuberculosis")

Link to annotation and genome:

https://www.ncbi.nlm.nih.gov/assembly/GCF_000195955.2/

hpages commented 3 years ago

I only see a warning here. Is that what is bothering you?

Other than that, seems to be working just fine:

library(GenomicFeatures)

txdb <- makeTxDbFromGFF("GCA_000195955.2_ASM19595v2_genomic.gff.gz")
# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... OK
# Warning message:
# In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
#   The following transcripts have multiple parts that were merged:
#   gene-Rv3216

txdb
# TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: GCA_000195955.2_ASM19595v2_genomic.gff.gz
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# Nb of transcripts: 4111
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2021-02-04 10:25:43 -0800 (Thu, 04 Feb 2021)
# GenomicFeatures version at creation time: 1.42.1
# RSQLite version at creation time: 2.2.3
# DBSCHEMAVERSION: 1.2

head(transcriptLengths(txdb))
#   tx_id tx_name gene_id nexon tx_len
# 1     1    dnaA    dnaA     1   1524
# 2     2    dnaN    dnaN     1   1209
# 3     3    recF    recF     1   1158
# 4     4  Rv0004  Rv0004     1    564
# 5     5    gyrB    gyrB     1   2028
# 6     6    gyrA    gyrA     1   2517

If it didn't work for you, please share the details.

Thanks, H.

sessionInfo():

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10

Matrix products: default
BLAS:   /home/hpages/R/R-4.0.3/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.0.3/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] GenomicFeatures_1.42.1 AnnotationDbi_1.52.0   Biobase_2.50.0        
[4] GenomicRanges_1.42.0   GenomeInfoDb_1.26.2    IRanges_2.24.1        
[7] S4Vectors_0.28.1       BiocGenerics_0.36.0   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6                  lattice_0.20-41            
 [3] prettyunits_1.1.1           Rsamtools_2.6.0            
 [5] Biostrings_2.58.0           assertthat_0.2.1           
 [7] BiocFileCache_1.14.0        R6_2.5.0                   
 [9] RSQLite_2.2.3               httr_1.4.2                 
[11] pillar_1.4.7                zlibbioc_1.36.0            
[13] rlang_0.4.10                progress_1.2.2             
[15] curl_4.3                    rstudioapi_0.13            
[17] blob_1.2.1                  Matrix_1.3-2               
[19] BiocParallel_1.24.1         stringr_1.4.0              
[21] RCurl_1.98-1.2              bit_4.0.4                  
[23] biomaRt_2.46.2              DelayedArray_0.16.1        
[25] compiler_4.0.3              rtracklayer_1.50.0         
[27] pkgconfig_2.0.3             askpass_1.1                
[29] openssl_1.4.3               tidyselect_1.1.0           
[31] SummarizedExperiment_1.20.0 tibble_3.0.6               
[33] GenomeInfoDbData_1.2.4      matrixStats_0.58.0         
[35] XML_3.99-0.5                crayon_1.4.0               
[37] dplyr_1.0.4                 dbplyr_2.1.0               
[39] GenomicAlignments_1.26.0    bitops_1.0-6               
[41] rappdirs_0.3.3              grid_4.0.3                 
[43] lifecycle_0.2.0             DBI_1.1.1                  
[45] magrittr_2.0.1              stringi_1.5.3              
[47] cachem_1.0.2                XVector_0.30.0             
[49] xml2_1.3.2                  ellipsis_0.3.1             
[51] generics_0.1.0              vctrs_0.3.6                
[53] tools_4.0.3                 bit64_4.0.5                
[55] glue_1.4.2                  purrr_0.3.4                
[57] hms_1.0.0                   MatrixGenerics_1.2.1       
[59] fastmap_1.1.0               memoise_2.0.0