Bioconductor / BSgenome

Software infrastructure for efficient representation of full genomes and their SNPs
https://bioconductor.org/packages/BSgenome
8 stars 8 forks source link

New BSgenome - Mycobacterium tuberculosis #19

Closed Ellauyz closed 2 years ago

Ellauyz commented 2 years ago

Good afternoon everyone

I am trying to run the BBCAnalyzer package in order to analyze data derived from whole genome sequencing of Mycobacterium tuberculosis (https://www.bioconductor.org/packages/release/bioc/html/BBCAnalyzer.html). This package requires a BSGenome object defining the reference genome used for comparisons.

I couldn't find a BSgenome package for M. tuberculosis, so I looked through the instructions on how to forge a BSgenome package myself, but I couldn't figure out how to do it (newbie at RStudio).

The genome I'm interested in is Mycobacterium tuberculosis H37Rv (ASM19595v2), available here https://www.ncbi.nlm.nih.gov/assembly/GCF_000195955.2 This specific genome is one of the reference genomes for M. tuberculosis, as well as one of the main strains used in research, so I think it would be good to have it available through BSgenomes.

Could anyone help? It would be fantastic if someone was willing to build the corresponding BSgenome.

Thank you in advance! Best regards,

Ellauyz

LiNk-NY commented 2 years ago

Hi Ellauyz, @Ellauyz Thank you for the request. Hervé @hpages knows if we can support this organism. For the time being, I have produced the script that would be required to forge the package.

testfile <- normalizePath("./genome_assemblies_genome_fasta.tar")
wd <- dirname(testfile)
download.file("https://www.ncbi.nlm.nih.gov/projects/r_gencoll/ftp_service/nph-gc-ftp-service.cgi/?HistoryId=MCID_61391356369bba150d3809be&QueryKey=1&ReleaseType=RefSeq&FileType=GENOME_FASTA&Flat=true", testfile)
untar(testfile)
fastazip <- list.files(wd, recursive = TRUE, pattern = "fna.gz$")
library(BSgenome)
fasta.seqlengths(fastazip)

mtb_meta <- data.frame(
    Package = "BSgenome.Mtuberculosis.NCBI.ASM19595v2",
    Title = "Full genome sequences for Mycobacterium tuberculosis H37Rv (high GC Gram+)",
    Description = "Full genome sequences for Mycobacterium tuberculosis as provided by NCBI (ASM19595v2, 2013-02-01)",
    Version = "1.0.0",
    Author = "Sanger Institute",
    Maintainer = "Bioconductor Maintainer <maintainer@bioconductor.org>",
    License = "Artistic 2.0",
    organism = "Mycobacterium tuberculosis",
    genome = "ASM19595v2",
    provider = "NCBI",
    release_date = "2013-02-01",
    source_url = "https://www.ncbi.nlm.nih.gov/assembly/GCF_000195955.2",
    organism_biocview = "mycobacterium_tuberculosis",
    BSgenomeObjname = "Mtuberculosis",
    SrcDataFiles = "GCF_000195955.2_ASM19595v2_genomic.fna from https://www.ncbi.nlm.nih.gov/assembly/GCF_000195955.2",
    seqs_srcdir = "/home/mr148/ASM19595v2/ncbi-genomes-2021-09-08/GCF_000195955.2_ASM19595v2_genomic.fna.gz"
)

write.dcf(mtb_meta, file = "BSgenome.Mtuberculosis.NCBI.ASM19595v2-seed")

forgeBSgenomeDataPkg(file.path(wd, "BSgenome.Mtuberculosis.NCBI.ASM19595v2-seed"))
Ellauyz commented 2 years ago

Thank you so much Marcel @LiNk-NY for the answer!

I tried to run the script provided, and ran into some issues. I've solved some of them, but for now I can't get the forgeBSgenomeDataPkg to work.

I've tried to check how I could correct this, based on answers from https://www.biostars.org/p/236492/ I changed some fields of the mtb_meta data.frame (see below, mtb_meta version 2). When I use this version of the mtb_meta data.frame, I no longer get the error code for the genome slots, but a new error instead:

> forgeBSgenomeDataPkg(file.path(wd, "BSgenome.Mtuberculosis.NCBI.ASM19595v2-seed")) Error in forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir, : values for symbols RELEASENAME are not single strings

I added some changes on the mtb_meta data.frame command, based on the vignette for BSgenomeForge (see below mtb_meta version 3). This leads to a different error again:

> forgeBSgenomeDataPkg(file.path(wd, "BSgenome.Mtuberculosis.NCBI.ASM19595v2-seed")) Creating package in ./BSgenome.Mtuberculosis.NCBI.ASM19595v2 Error in .forgeTwobitFileFromFastaFiles(seqnames, prefix, suffix, seqs_srcdir, : 'seqnames' must be a character vector In addition: Warning message: In forgeSeqFiles(x@provider, x@provider_version, .seqnames, mseqnames = .mseqnames, : 'seqnames' is empty

I'm not sure where to go from here, or if the changes I made to the mtb_meta data.frame are introducing mistakes of my own. Any suggestions on what could be happening?

Thank you so much again for your help, I really appreciate it!

Ellauyz


mtb_meta version 2 mtb_meta <- data.frame( Package = "BSgenome.Mtuberculosis.NCBI.ASM19595v2", Title = "Full genome sequences for Mycobacterium tuberculosis H37Rv (high GC Gram+)", Description = "Full genome sequences for Mycobacterium tuberculosis as provided by NCBI (ASM19595v2, 2013-02-01)", Version = "1.0.0", Author = "Sanger Institute", Maintainer = "Bioconductor Maintainer <maintainer@bioconductor.org>", License = "Artistic 2.0", organism = "Mycobacterium tuberculosis", common_name = "Mycobacterium tuberculosis", provider = "NCBI", provider_version = "ASM19595v2", release_date = "2013-02-01", source_url = "https://www.ncbi.nlm.nih.gov/assembly/GCF_000195955.2", organism_biocview = "mycobacterium_tuberculosis", BSgenomeObjname = "Mtuberculosis", SrcDataFiles = "GCF_000195955.2_ASM19595v2_genomic.fna from https://www.ncbi.nlm.nih.gov/assembly/GCF_000195955.2", seqs_srcdir = "/home/mr148/ASM19595v2/ncbi-genomes-2021-09-08/GCF_000195955.2_ASM19595v2_genomic.fna.gz" )


mtb_meta version 3 mtb_meta <- data.frame( Package = "BSgenome.Mtuberculosis.NCBI.ASM19595v2", Title = "Full genome sequences for Mycobacterium tuberculosis H37Rv (high GC Gram+)", Description = "Full genome sequences for Mycobacterium tuberculosis as provided by NCBI (ASM19595v2, 2013-02-01)", Version = "1.0.0", Author = "Sanger Institute", Maintainer = "Bioconductor Maintainer <maintainer@bioconductor.org>", License = "Artistic 2.0", organism = "Mycobacterium tuberculosis", common_name = "Mycobacterium tuberculosis", provider = "NCBI", provider_version = "ASM19595v2", release_date = "2013-02-01", release_name = "ASM19595v2", source_url = "https://www.ncbi.nlm.nih.gov/assembly/GCF_000195955.2", organism_biocview = "mycobacterium_tuberculosis", BSgenomeObjname = "Mtuberculosis", SrcDataFiles = "GCF_000195955.2_ASM19595v2_genomic.fna from https://www.ncbi.nlm.nih.gov/assembly/GCF_000195955.2", seqs_srcdir = "/home/mr148/ASM19595v2/ncbi-genomes-2021-09-08/GCF_000195955.2_ASM19595v2_genomic.fna.gz" )

hpages commented 2 years ago

@Ellauyz I think there is a misunderstanding about the expectations. We do not forge every single BSgenome data package that people need. There's a vignette that explains how to do this ("How to forge a BSgenome data package", it's in the BSgenome package) in case you need a BSgenome data package that is not available in Bioconductor. Please consult the vignette and ask a specific question in case the vignette is not clear or things are not working as expected.

Here is some information that will hopefully get you started:

The NCBI landing page for this assembly is https://www.ncbi.nlm.nih.gov/assembly/GCF_000195955.2/

The assembly has only one sequence (called ANONYMOUS in the assembly report) that you can download from here (GCF_000195955.2_ASM19595v2_genomic.fna.gz file) or from here (GCA_000195955.2_ASM19595v2_genomic.fna.gz file). The 2 files are compressed FASTA files with a single FASTA record in them. They're identical except for the name of the sequence: they use the RefSeq accession (NC_000962.3) in the former and the GenBank accession (AL123456.3) in the latter. If you are curious, you can load these files in an R session with readDNAStringSet() from Biostrings. The function will return the sequences in a DNAStringSet object of length 1. Subsetting this object with the double-bracket operator (e.g. x[[1]]) will extract the first and unique sequence as a DNAString object.

As explained in the "How to forge a BSgenome data package" vignette, you can control how the sequence is going to be named in the BSgenome data package. For this you just need to rename the FASTA file according to the name you choose for the sequence (e.g. NC_000962.3.fna.gz) and add the following lines to your seed file:

seqnames: "NC_000962.3"
circ_seqs: character(0)
seqfiles_suffix: .fna.gz

So your seed file should look something like this:

Package: BSgenome.Mtuberculosis.NCBI.ASM19595v2
Title: Genome sequence for Mycobacterium tuberculosis H37Rv (high GC Gram+), assembly ASM19595v2
Description: Genome sequence for Mycobacterium tuberculosis H37Rv (high GC Gram+) as provided by NCBI (assembly ASM19595v2).
Version: 1.0.0
organism: Mycobacterium tuberculosis
provider: NCBI
genome: ASM19595v2
release_date: 2013/02/01
source_url: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/195/955/GCF_000195955.2_ASM19595v2/
organism_biocview: Mycobacterium_tuberculosis
BSgenomeObjname: Mtuberculosis
SrcDataFiles: GCF_000195955.2_ASM19595v2_genomic.fna.gz, downloaded from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/195/955/GCF_000195955.2_ASM19595v2/ on September 9, 2021, and renamed to NC_000962.3.fna.gz
PkgExamples: genome$NC_000962.3  # same as genome[["NC_000962.3"]]
seqnames: "NC_000962.3"
circ_seqs: character(0)
seqfiles_suffix: .fna.gz
seqs_srcdir: /local/path/to/the/directory/where/the/FASTA/file/is/located

Hope this helps,

H.

hpages commented 2 years ago

@Ellauyz Were you able to forge BSgenome.Mtuberculosis.NCBI.ASM19595v2? I'll close this issue next week if I don't hear back from you.

Ellauyz commented 2 years ago

No, I haven't been found a way to solve the seqname issues. I appreciate the help received, it definitively helped getting me started on how to do it myself! You can close this issue and I will work on this on my end. Thank you! Ellauyz

hpages commented 2 years ago

Good luck. Don't hesitate to reopen or to create a new issue if you need more help with this.

Alternatively, you can ask help on the Bioconductor support site: https://support.bioconductor.org/

H.

dpschreiner commented 11 months ago

Hello!

I have taken this up again based on the code above, and I also observe similar behavior. For one thing, it seems like forgeBSgenomeDataPkg expects the seqnames given in the seed to be already instantiated R objects. This is from BSgenomeForge.R:

        seqnames <- x@seqnames
        if (!is.na(seqnames)) {
            .seqnames <- eval(parse(text=seqnames))
        }

Thus i get the message, for example:

Error in eval(parse(text = seqnames)) : 
  object 'NC_000962.3' not found

I would instead expect the other information in the seed (seqs_srcdir, possibly seqfiles_suffix) to be used to get this information directly from the fasta on the file system.

If I create an object like this:

NC_000962.3 <- readDNAStringSet(file.path(dataDir, "NC_000962.3.fna"), format="fasta")

Then I proceed to a more generic issue regarding an already existing package destination dir:

> forgeBSgenomeDataPkg(file.path(wd, "BSgenome.Mtuberculosis.NCBI.ASM19595v2-seed"))
Creating package in ./BSgenome.Mtuberculosis.NCBI.ASM19595v2 
Error in Biobase::createPackage(x@Package, destdir, template_path, symvals) : 
  directory './BSgenome.Mtuberculosis.NCBI.ASM19595v2' exists; use unlink=TRUE to remove it, or choose another destination directory
> forgeBSgenomeDataPkg(file.path(wd, "BSgenome.Mtuberculosis.NCBI.ASM19595v2-seed"), unlink=TRUE)
Error in forgeBSgenomeDataPkg(file.path(wd, "BSgenome.Mtuberculosis.NCBI.ASM19595v2-seed"),  : 
  unused argument (unlink = TRUE)
> forgeBSgenomeDataPkg(file.path(wd, "BSgenome.Mtuberculosis.NCBI.ASM19595v2-seed"), replace=TRUE)
Error in forgeBSgenomeDataPkg(file.path(wd, "BSgenome.Mtuberculosis.NCBI.ASM19595v2-seed"),  : 
  unused argument (replace = TRUE)

e.g. neither unlink nor replace are accepted by this function as parameters.

If I manually delete the destination and proceed, I find that functions called from ForgeSeqFiles want .seqnames as a character vector:

> forgeBSgenomeDataPkg(file.path(wd, "BSgenome.Mtuberculosis.NCBI.ASM19595v2-seed"))
Creating package in ./BSgenome.Mtuberculosis.NCBI.ASM19595v2 
Error in .forgeTwobitFileFromFastaFiles(seqnames, prefix, suffix, seqs_srcdir,  : 
  'seqnames' must be a character vector

Could it be that "seqnames" rather than ".seqnames" should be passed to ForgeSeqFiles?

And/or that another method rather than eval(parse()) be used in forgeBSgenomeDataPkg for the case when explicitly naming seqnames?

hpages commented 11 months ago

Hi,

Error in .forgeTwobitFileFromFastaFiles(seqnames, prefix, suffix, seqs_srcdir, : 'seqnames' must be a character vector

Did you read the "How to forge a BSgenome data package" vignette from the BSgenome package? How do you specify seqnames in your seed file? Should be:

seqnames: "AL123456.3"

Also it's highly recommended to convert the FASTA file downloaded from NCBI to the 2bit format, and to rename the sequences in it from their RefSeq or GenBank names to their canonical names.

Anyways, if that sounds too complicated to you, the good news is that the new BSgenomeForge package makes all this a lot easier:

library(BSgenomeForge)

forgeBSgenomeDataPkgFromNCBI("GCF_000195955.2",
             pkg_maintainer="Jane Doe <janedoe@gmail.com>",
             organism="Mycobacterium tuberculosis H37Rv",
             circ_seqs=character(0))

This will create the BSgenome.Mh37rv.NCBI.ASM19595v2 package in the current directory :smiley:

If you don't like the package name and would rather use BSgenome.Mtuberculosis.NCBI.ASM19595v2 instead, then please edit the DESCRIPTION, R/zzz.R, and man/package.Rd files in the ./BSgenome.Mh37rv.NCBI.ASM19595v2/ folder.

See the How to use the BSgenomeForge package to forge a BSgenome data package vignette for more information.

Cheers, H.

dpschreiner commented 11 months ago

yes, i did have success once i pre-created the 2bit file. thanks for the tips!