Bioconductor / GenomeInfoDb

Utilities for manipulating chromosome names, including modifying them to follow a particular naming style
https://bioconductor.org/packages/GenomeInfoDb
31 stars 13 forks source link

Please help adding Xenopus tropicalis to GenomeInfoDb package #101

Closed clarkzor closed 10 months ago

clarkzor commented 10 months ago

Hello, I am trying to utilize the "forgeBSgenomeDataPkg("../Xenopustropicalis_Seedfile.dcf")" to create a new package from a seed file, however I receive the error message:

Error in .make_Seqinfo_from_genome(genome) : "xenTro" is not a registered NCBI assembly or UCSC genome (use registered_NCBI_assemblies() or registered_UCSC_genomes() to list the NCBI or UCSC assemblies/genomes currently registered in the GenomeInfoDb package)

This is what my seed file looks like:

Package: BSgenome.Xtrop.NCBI.UCB_Xtro_10.0 Title: Full genomic sequences for Xenopus Tropicalis VIA NCBI Description: For Multiomic Analysis Version: 1.0.0 organism: Xenopus tropicalis common_name: Frog provider: NCBI genome: UCB_Xtro_10.0 release_date: Nov. 2019 source_url: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000004195.4/ organism_biocview: Xenopus_tropicalis BSgenomeObjname: Xtrop seqs_srcdir: /Users/coron/OneDrive/Desktop/10xGenomics/ seqfile_name: UCB_Xtro_10.0_genome.fasta

could you please add the Xenopus tropicalis genome to the GenomeInfoDb package so that I could use this function.

Also, should I be able to run this using a single genomic fasta file that contains information about all the chromosomes, or do I have to actually split up the chromosomes into their own individual fasta files?

Thank you for your help.

hpages commented 10 months ago

This is a duplicate from https://github.com/Bioconductor/BSgenome/issues/75

BTW your seed file does not contain xenTro so I'm not sure how you get this error about xenTro not being registered in GenomeInfoDb. Note that there's actually no assembly named xenTro at NCBI (see https://www.ncbi.nlm.nih.gov/assembly/?term=xenTro) so that's not something we would be able to register anyway.

Finally, and FWIW, note that assembly UCB_Xtro_10.0 is already registered in GenomeInfoDb:

> library(GenomeInfoDb)

> registered_NCBI_assemblies("tropicalis")
            organism      assembly       date      extra_info
1 Xenopus tropicalis UCB_Xtro_10.0 2019/11/14 strain:Nigerian
2 Xenopus tropicalis  ASM1336827v1 2020/06/23 strain:Nigerian
  assembly_accession circ_seqs
1    GCF_000004195.4        MT
2    GCA_013368275.1          

> Seqinfo(genome="UCB_Xtro_10.0")
Seqinfo object with 167 sequences (1 circular) from UCB_Xtro_10.0 genome:
  seqnames seqlengths isCircular        genome
  Chr1      217471166      FALSE UCB_Xtro_10.0
  Chr2      181034961      FALSE UCB_Xtro_10.0
  Chr3      153873357      FALSE UCB_Xtro_10.0
  Chr4      153961319      FALSE UCB_Xtro_10.0
  Chr5      164033575      FALSE UCB_Xtro_10.0
  ...             ...        ...           ...
  Sca152          786      FALSE UCB_Xtro_10.0
  Sca153          755      FALSE UCB_Xtro_10.0
  Sca154          748      FALSE UCB_Xtro_10.0
  Sca155          593      FALSE UCB_Xtro_10.0
  Sca156          582      FALSE UCB_Xtro_10.0

Anyways, all this becomes irrelevant if you forge the BSgenome package as suggested here: https://github.com/Bioconductor/BSgenome/issues/75#issuecomment-1912753142

sessionInfo():

R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 23.10

Matrix products: default
BLAS:   /home/hpages/R/R-4.3.0/lib/libRblas.so 
LAPACK: /home/hpages/R/R-4.3.0/lib/libRlapack.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] GenomeInfoDb_1.38.5 IRanges_2.36.0      S4Vectors_0.40.2   
[4] BiocGenerics_0.48.1

loaded via a namespace (and not attached):
[1] compiler_4.3.0          GenomeInfoDbData_1.2.11 RCurl_1.98-1.14        
[4] bitops_1.0-7