Bioconductor / BSgenome

Software infrastructure for efficient representation of full genomes and their SNPs
https://bioconductor.org/packages/BSgenome
7 stars 9 forks source link

error forgebsgenome #35

Closed balathumma10 closed 1 year ago

balathumma10 commented 1 year ago

Hi, I am trying to use forgebsgenome to develop bsgenome for eucalyptus grandis genome which is provided by JGI. When I tried to use the command forgeBSgenomeDataPkg with the seed file prepared I get the following error. Could you please help solve this. Thanks.

Error in .make_Seqinfo_from_genome(genome) : "v2.0" is not a registered NCBI assembly or UCSC genome (use registered_NCBI_assemblies() or registered_UCSC_genomes() to list the NCBI or UCSC assemblies/genomes currently registered in the GenomeInfoDb package)

Error in .make_Seqinfo_from_genome(genome) : "Egrandis_297_v2.0" is not a registered NCBI assembly or UCSC genome (use registered_NCBI_assemblies() or registered_UCSC_genomes() to list the NCBI or UCSC assemblies/genomes currently registered in the GenomeInfoDb package) In addition: Warning messages: 1: In forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir, : field 'provider_version' is deprecated in favor of 'genome' 2: In forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir, : field 'release_name' is deprecated

balathumma10 commented 1 year ago

Here is my seed file Package: BSgenome.Egrandis.JGI.v2.0 Title: Full genome sequence for Eucalyptus grandis Description: Full genome sequence of Eucalyptus grandis v2.0 provided by JGI Version: 2.0 organism: Eucalyptus grandis common_name: rose gum provider: phytozome genome: Eucalyptus grandis v2.0 release_date: 2014/12 release_name: JGI source_url: https://phytozome-next.jgi.doe.gov/info/Egrandis_v2_0 BSgenomeObjname: Egrandis seqnames: 1:11 seqfiles_prefix: Egv2.0_chr seqfiles_suffix: .fasta seqs_srcdir: /media/bala/Data21/crispr/seqs_srcdir

balathumma10 commented 1 year ago

I have split the genome fasta file into individaul chromosome and scaffold level fast files. Still I get this error.

Error in .make_Seqinfo_from_genome(genome) : "Eucalyptus grandis v2.0" is not a registered NCBI assembly or UCSC genome (use registered_NCBI_assemblies() or registered_UCSC_genomes() to list the NCBI or UCSC assemblies/genomes currently registered in the GenomeInfoDb package) In addition: Warning message: In forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir, : field 'release_name' is deprecated

hpages commented 1 year ago

Hi @balathumma10,

Unfortunately we cannot register assemblies from JGI in the GenomeInfoDb package at the moment. Anyways, registration is not required for forging a BSgenome data package. For assemblies not registered in GenomeInfoDb, you need to do one of the following:

  1. List the chromosome/scaffold names in the seqnames: field (like you did).
  2. Convert the big FASTA file containing the entire assembly (Egrandis_297_v2.0.fa.gz) to the 2-bit format and use the seqfile_name: field to specify the name of the .2bit file. When using a .2bit file, 1. is no longer needed. Note that using a .2bit file will also result in a BSgenome data package that is slightly smaller and more performant. Converting from FASTA to .2bit is generally easy. See issue #26 for some guidance.

Finally, whether you choose 1. or 2., you also need to specify which sequences are circular in the circ_seqs: field. Note that this is mandatory, even if there are no circular chromosomes (in which case circ_seqs should be set to character(0)).

Hope this helps, H.

balathumma10 commented 1 year ago

Thank you. I have used a version of the genome available in the NCBI and converted it to a .2bit format. I am able to generate BSgenome using the forgeBSgenome package.

hpages commented 1 year ago

Sounds good. Just to be clear, there's no "forgeBSgenome package". I guess you meant you used the forgeBSgenomeDataPkg() function from the BSgenome package.

balathumma10 commented 1 year ago

Yes, my bad. I meant forgeBSgenomeDataPkg function of BSgenome package.