Use unpublished in-house genome fasta for creating a BSgenome object

alslonik commented 1 year ago

Is it possible to Use unpublished in-house genome fasta for creating a BSgenome object ? We would like to be able to create a BSgenome object from our own yet unpublished data.

I am trying to do it with the following seed file: (Pgranatum.1.fa etc are in the /home/alex/work/genomepackage/seqs_srcdir folder)

package: BSgenome.Pgranatum.ARO Title: Full genome sequence for Punica granatum wonderful cultivar Description: Full genome sequence for Punica granatum wonderful cultivar Version: 1.0.0 organism: Punica Granatum common_name: P. granatum provider: genome provider_version: genome source_url: organism_biocview: Punica_granatum BSgenomeObjname: Pgranatum seqnames: c("Pgranatum.1","Pgranatum.2", "Pgranatum.3","Pgranatum.4", "Pgranatum.5", "Pgranatum.6", "Pgranatum.7", "Pgranatum.8", "Pgranatum.9") seqs_srcdir: /home/alex/work/genomepackage/seqs_srcdir

The error I am getting when trying to forge is:

Error in .make_Seqinfo_from_genome(genome) : 
  "genome" is not a registered NCBI assembly or UCSC genome (use registered_NCBI_assemblies() or registered_UCSC_genomes() to
  list the NCBI or UCSC assemblies/genomes currently registered in the GenomeInfoDb package)
In addition: Warning message:
In forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir,  :
  field 'provider_version' is deprecated in favor of 'genome'

hpages commented 1 year ago

Hi,

For genomes not registered in GenomeInfoDb, you must specify the circ_seqs field, even if your genome does not have circular sequences (in which case circ_seqs should be set to character(0)).

Also:

Why provider: genome? Why not put something a little bit more meaningful than "genome" for the name of the provider of your in-house genome? Note that you've named the package BSgenome.Pgranatum.ARO, which means that, if you are following the naming scheme for BSgenome data packages, the 3rd part (ARO) is the name of the provider. So why not set the provider field to that instead?
provider_version: genome: As indicated by the warning you got, the provider_version field is deprecated in favor of the genome field. The value for this field should be the name of your in-house genome or assembly, so hopefully you can come up with a better name than "genome" for your in-house assembly. Think of how you're going to refer to this assembly in group meetings or when you communicate with co-workers. You're not going to refer to it as "genome" are you?
Once you have figured out a good name for this assembly, say PGv1, you should embed that name in the name of the package itself e.g. BSgenome.Pgranatum.ARO.PGv1. The naming scheme for BSgenome data packages is to use a name made of 4 parts, the 4th part being the name of the genome or assembly.
Remember that you must have one FASTA file per sequence name in /home/alex/work/genomepackage/seqs_srcdir and that these files must be named <seqname>.fa.

Finally note that using a 2bit file for the genomic sequences is preferred over using a collection of FASTA files. Converting from FASTA to 2bit is easy to do in R: use Biostrings::readDNAStringSet() to import the FASTA file(s) as a DNAStringSet object and export that object with rtracklayer::export.2bit().

Hope this helps, H.

alslonik commented 1 year ago

It helps a lot! thanks )

hpages commented 1 year ago

Glad it helped.

Were you able to forge, install, load, and use the forged package? If so then feel free to close this issue.

Thanks

alslonik commented 1 year ago

Yes, forged, installed and used. Issue is closed, thank you very much.

Bioconductor / BSgenome

Use unpublished in-house genome fasta for creating a BSgenome object #59