Bioconductor / BSgenome

Software infrastructure for efficient representation of full genomes and their SNPs
https://bioconductor.org/packages/BSgenome
7 stars 9 forks source link

Error using forgeBSgenomeDataPkg() to forge bacterial species not included registered_NCBI_assemblies() nor registered_UCSC_genomes() #55

Closed Guan06 closed 1 year ago

Guan06 commented 1 year ago

Hi,

I am trying to forge a data package for the bacteria I am working with (Bacteroides uniformis), however, when preparing seed file, in the field "genome", non of the close genome of the species could be find in either registered_NCBI_assemblies() nor registered_UCSC_genomes(). Any suggestion what could be done to bypass this error?

Many thanks and looking forward to your suggestions!

Best, Rui

hpages commented 1 year ago

Hi @Guan06 ,

For an unregistered genome, you must specify the seqnames field in your seed file.

If the genome is registered in GenomeInfoDb, then you don't need to specify the seqnames field because, in this case, forgeBSgenomeDataPkg() will be able to fetch the sequence names for you. So the process of forging a BSgenome data package is just more convenient when the genome is registered in GenomeInfoDb, but it's not a requirement.

Hope this helps, H.

Guan06 commented 1 year ago

Thank you Herve very much for your reply!

I added the filed @seqnames but the error message is still there.. below is the non-standard DESCRIPTION fields of the seed file that I made according to 2.2.3 in vignette:

organism: Bacteroides uniformis ATCC8492 common_name: Bacteroides uniformis genome: Bacteroides uniformis provider: NCBI release_date: 2022/09/12 source_url: https://www.ncbi.nlm.nih.gov/nuccore/CP102263.1 organism_biocview: Bacteroides_uniformis_ATCC8492 BSgenomeObjname: Buniformis_ATCC8492 seqnames: c("Buni8492") circ_seqs: character(0) seqs_srcdir: /rds-d6/user/rg684/hpc-work/bin/Buniformis_BSgenome/ seqfile_name: ATCC8492.2bit ondisk_seq_format: rda

Inside the folder /rds-d6/user/rg684/hpc-work/bin/Buniformis_BSgenome/ I have the following files: ATCC8492.2bit Buni8492.fa

Thanks again and best, Rui

hpages commented 1 year ago

I added the filed @seqnames but the error message is still there..

What error message? You never showed it.

I see that you have the following line in your seed file:

ondisk_seq_format: rda

I'm not quite sure what you're trying to achieve with this, but, since you have the genomic sequences in a 2bit file (ATCC8492.2bit), you should not need to specify ondisk_seq_format.

H.

Guan06 commented 1 year ago

I added the filed @seqnames but the error message is still there..

What error message? You never showed it.

I see that you have the following line in your seed file:

ondisk_seq_format: rda

I'm not quite sure what you're trying to achieve with this, but, since you have the genomic sequences in a 2bit file (ATCC8492.2bit), you should not need to specify ondisk_seq_format.

H.

Oh sorry.. it is the error saying that 'genome' is not included in registered_NCBI_assemblies() or registered_UCSC_genomes(); but after removing the 'ondisk_seq_format' field and rerun everything, the error message was gone.

Thank you again for your help!

Best, Rui