Can't find genome in NCBI and UCSC

jang1563 commented 5 months ago

Hi,

I'm trying to forge with Saccharomyces Cerevisiae BY4743 downloaded from here (https://www.atcc.org/products/201390). The issues is that this strain is not registered in either NCBI or UCSC.

So I encountered this error. Error in .make_Seqinfo_from_genome(genome): "Scer-BY4743" is not a registered NCBI assembly or UCSC genome (use registered_NCBI_assemblies() or registered_UCSC_genomes() to list the NCBI or UCSC assemblies/genomes currently registered in the GenomeInfoDb package)

Is there any suggestion to solve this issue?

hpages commented 5 months ago

This issue should go away if you specify the seqnames and circ_seqs fields in your seed file, so make sure both are specified.

hpages commented 3 months ago

Were you able to sort this out @jang1563?

jang1563 commented 3 months ago

@hpages I still have error due to missing 'genome' filed.

Here is my seed file.

'mySeedfile' Package: BSgenome.Scerevisiae.BY4743.2nd Title: Full genome sequence for Saccharomyces Cerevisiae BY4743 Description: Full genome sequence for Saccharomyces Cerevisiae BY4743 Version: 1.0.0 BSgenomeObjname: Scerevisiae seqnames: c("scaffold_1", "scaffold_2", "scaffold_3", "scaffold_4", "scaffold_5", "scaffold_6", "scaffold_7", "scaffold_8", "scaffold_9", "scaffold_10", "scaffold_11", "scaffold_12", "scaffold_13", "scaffold_14", "scaffold_15", "scaffold_16", "scaffold_17", "scaffold_18", "scaffold_19", "scaffold_20", "scaffold_21", "scaffold_22", "scaffold_23", "scaffold_24", "scaffold_25") circ_seqs: character(0) seqs_srcdir: /athena/masonlab/scratch/users/jak4013/Artemis/Artemis_I/Fasta/seqs_srcdir/split

With this seed file, I tried to run the following code.

forgeBSgenomeDataPkg("mySeedfile")

However, I encountered this error. -> Error in forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir, : 'genome' field is missing in seed file

Do you have any suggestions?

hpages commented 2 months ago

The error message is pretty clear in this case: it tells you that the genome field is missing in your seed file. So my suggestion is that you add it :wink:

Please refer to the How to forge a BSgenome data package vignette in the BSgenome package for more information.

Best

jang1563 commented 2 months ago

The manual says the genome files is non-necessary but it failed due to the missing genome. This point is confusing to me. As I mentioned in the 1st question, this reference is custom so there is no matching genome in NCBI or UCSC.

hpages commented 2 months ago

The manual says the genome files is non-necessary

Not sure what manual you are looking at but that is not what it says. Here is the vignette you want to consult: https://bioconductor.org/packages/release/bioc/vignettes/BSgenome/inst/doc/BSgenomeForge.pdf

In the vignette, fields that are not necessary are marked with a big upper case OPTIONAL e.g.:

PkgDetail: [OPTIONAL] Some arbitrary text that will be copied to the Details section of the man page of the target package.

For genome we have:

genome: The name of the genome. Typically the name of an NCBI assembly (e.g. GRCh38.p12, WBcel235, TAIR10.1, ARS-UCD1.2, etc... or UCSC genome (e.g. hg38, bosTau9, galGal6, ce11, etc... Should preferably match part 4 of the package name (field Package). For the packages built by the Bioconductor project from a UCSC genome, this field corresponds to the UCSC VERSION field of the List of UCSC genome releases table.

So yes, the genome field is required.

This point is confusing to me. As I mentioned in the 1st question, this reference is custom so there is no matching genome in NCBI or UCSC.

Just give your genome a name (note that the name should not contain spaces or special characters other than .). Who says it has to match a genome in NCBI or UCSC? "Typically" in "Typically the name of an NCBI assembly or UCSC genome" doesn't mean "it must be".

jang1563 commented 2 months ago

I solved this issue with this seed file.

In this case, it worked well without 'genome' field.

Package: BSgenome.Scerevisiae.BY4743.04.11.2024.ver3 Title: Full genome sequence for Saccharomyces Cerevisiae BY4743 Description: Full genome sequence for Saccharomyces Cerevisiae BY4743 Version: 1.0.0 organism: Yeast common_name: Yeast organism_biocview: Yeast provider: JK provider_version: ONT release_date: April, 2024 release_name: Scerevisiae.BY4743 source_url: NA BSgenomeObjname: Scerevisiae seqnames: c('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25') circ_seqs: character(0) seqs_srcdir: /athena/masonlab/scratch/users/jak4013/Artemis/Artemis_I/Fasta/seqs_srcdir/split/chr

Bioconductor / BSgenome

Can't find genome in NCBI and UCSC #74