Bioconductor / BSgenome

Software infrastructure for efficient representation of full genomes and their SNPs
https://bioconductor.org/packages/BSgenome
7 stars 9 forks source link

Unable to forge genome for EquCab #29

Closed prmunn closed 2 years ago

prmunn commented 2 years ago

I'm having problems forging a genome for EquCab3.0. Apparently there isn't a registered NCBI assembly or UCSC genome in the GenomeInfoDb package so I get the message: Error in .make_Seqinfo_from_genome(genome) : "equCab3" is not a registered NCBI assembly or UCSC genome (use registered_NCBI_assemblies() or registered_UCSC_genomes() to list the NCBI or UCSC assemblies/genomes currently registered in the GenomeInfoDb package) when I try to put it in.

However, when I omit the genome from the seed file, and put in seqnames instead, I get the following error message: Error in forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir, : 'genome' field is missing in seed file

I've tried various versions of the seed file, but here is my latest attempt: Package: BSgenome.EquCab3.UCSC.GCF_002863925.1 Title: Full genome sequence for Equus cabellus 3.0 Description: Full genome sequence for the chromosomes of Equus cabellus 3.0 provided by UCSC Version: 1.0.0 organism: Equus cabellus 3.0 common_name: horse genome: GCF_002863925.1 provider: UCSC release_date: 2018/01 source_url: https://hgdownload.soe.ucsc.edu/goldenPath/equCab3/bigZips/ organism_biocview: Equus_cabellus BSgenomeObjname: EquCab3 seqnames: c(1:31, "X") seqs_srcdir: GenomicsInnovation/genomes/Equus_caballus/seqs_srcdir seqfile_name: Equus_caballus.EquCab3.2bit

prmunn commented 2 years ago

Found the issue. I needed to add circ_seqs: character(0) to the seed file.

hpages commented 2 years ago

@prmunn

FWIW I just registered UCSC genomes equCab1, equCab2, and equCab3 in GenomeInfoDb 1.32.1 to make forging those BSgenome data packages easier in the future. This will also allow you to do things like:

Seqinfo(genome="equCab3")

or:

library(BSgenome.Ecaballus.UCSC.equCab3)
genome <- BSgenome.Ecaballus.UCSC.equCab3
seqlevelsStyle(genome) <- "NCBI"

to rename the sequences with the NCBI names.

BTW naming scheme for BSgenome data packages is BSgenome.Ecaballus.UCSC.equCab3. Finally it seems that you forgot the mitochondrial chromosome (chrM).

Best, H.

prmunn commented 2 years ago

Thanks!