Bioconductor / BSgenome

Software infrastructure for efficient representation of full genomes and their SNPs
https://bioconductor.org/packages/BSgenome
7 stars 9 forks source link

Can't find genome in NCBI and UCSC #74

Closed jang1563 closed 3 months ago

jang1563 commented 5 months ago

Hi,

I'm trying to forge with Saccharomyces Cerevisiae BY4743 downloaded from here (https://www.atcc.org/products/201390). The issues is that this strain is not registered in either NCBI or UCSC.

So I encountered this error. Error in .make_Seqinfo_from_genome(genome): "Scer-BY4743" is not a registered NCBI assembly or UCSC genome (use registered_NCBI_assemblies() or registered_UCSC_genomes() to list the NCBI or UCSC assemblies/genomes currently registered in the GenomeInfoDb package)

Is there any suggestion to solve this issue?

hpages commented 5 months ago

This issue should go away if you specify the seqnames and circ_seqs fields in your seed file, so make sure both are specified.

hpages commented 3 months ago

Were you able to sort this out @jang1563?

jang1563 commented 3 months ago

@hpages I still have error due to missing 'genome' filed.

Here is my seed file.

'mySeedfile' Package: BSgenome.Scerevisiae.BY4743.2nd Title: Full genome sequence for Saccharomyces Cerevisiae BY4743 Description: Full genome sequence for Saccharomyces Cerevisiae BY4743 Version: 1.0.0 BSgenomeObjname: Scerevisiae seqnames: c("scaffold_1", "scaffold_2", "scaffold_3", "scaffold_4", "scaffold_5", "scaffold_6", "scaffold_7", "scaffold_8", "scaffold_9", "scaffold_10", "scaffold_11", "scaffold_12", "scaffold_13", "scaffold_14", "scaffold_15", "scaffold_16", "scaffold_17", "scaffold_18", "scaffold_19", "scaffold_20", "scaffold_21", "scaffold_22", "scaffold_23", "scaffold_24", "scaffold_25") circ_seqs: character(0) seqs_srcdir: /athena/masonlab/scratch/users/jak4013/Artemis/Artemis_I/Fasta/seqs_srcdir/split

With this seed file, I tried to run the following code.

forgeBSgenomeDataPkg("mySeedfile")

However, I encountered this error. -> Error in forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir, : 'genome' field is missing in seed file

Do you have any suggestions?

hpages commented 2 months ago

The error message is pretty clear in this case: it tells you that the genome field is missing in your seed file. So my suggestion is that you add it :wink:

Please refer to the How to forge a BSgenome data package vignette in the BSgenome package for more information.

Best

jang1563 commented 2 months ago

The manual says the genome files is non-necessary but it failed due to the missing genome. This point is confusing to me. As I mentioned in the 1st question, this reference is custom so there is no matching genome in NCBI or UCSC.

hpages commented 2 months ago

The manual says the genome files is non-necessary

Not sure what manual you are looking at but that is not what it says. Here is the vignette you want to consult: https://bioconductor.org/packages/release/bioc/vignettes/BSgenome/inst/doc/BSgenomeForge.pdf

In the vignette, fields that are not necessary are marked with a big upper case OPTIONAL e.g.:

For genome we have:

So yes, the genome field is required.

This point is confusing to me. As I mentioned in the 1st question, this reference is custom so there is no matching genome in NCBI or UCSC.

Just give your genome a name (note that the name should not contain spaces or special characters other than .). Who says it has to match a genome in NCBI or UCSC? "Typically" in "Typically the name of an NCBI assembly or UCSC genome" doesn't mean "it must be".

jang1563 commented 2 months ago

I solved this issue with this seed file.

In this case, it worked well without 'genome' field.

Package: BSgenome.Scerevisiae.BY4743.04.11.2024.ver3 Title: Full genome sequence for Saccharomyces Cerevisiae BY4743 Description: Full genome sequence for Saccharomyces Cerevisiae BY4743 Version: 1.0.0 organism: Yeast common_name: Yeast organism_biocview: Yeast provider: JK provider_version: ONT release_date: April, 2024 release_name: Scerevisiae.BY4743 source_url: NA BSgenomeObjname: Scerevisiae seqnames: c('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25') circ_seqs: character(0) seqs_srcdir: /athena/masonlab/scratch/users/jak4013/Artemis/Artemis_I/Fasta/seqs_srcdir/split/chr