Bioconductor / BSgenome

Software infrastructure for efficient representation of full genomes and their SNPs
https://bioconductor.org/packages/BSgenome
7 stars 9 forks source link

Is there any way to circumvent the need for 2bit? #28

Closed stevenjblair closed 2 years ago

stevenjblair commented 2 years ago

Hello again, I was pretty happy with myself this morning when forgeBSgenomeDataPkg() ran with no errors and I watched as the genome all got loaded into memory. A folder for the output was made and I had a mug of victory coffee!

Unfortunately, once the dust cleared and the loading finished I got an index overflow error from TwoBits_export. I specifically went with the fasta method because my genome is about 27 gb. I didn't realize that the forgeBSgenomeDataPkg() pipeline itself just converts the files into a 2bit? So I ended up with the following error:

Error in .TwoBits_export(mapply(.DNAString_to_twoBit, object, seqnames), : UCSC library operation failed In addition: Warning message: In .TwoBits_export(mapply(.DNAString_to_twoBit, object, seqnames), : Error in faToTwoBit, index overflow at chr7. The 2bit format does not support indexes larger than 4Gb, please split up into smaller files.

My seed

Package: BSgenome.Amexicanum.NCBI.ambMex60DD Title: Ambystoma Mexicanum (Axolotl) full genome (Schloissnig version V6.0-DD) Description: A chromosome-scale assembly of the axolotl genome as provided by Schloissnig (v6.0-DD, April. 2021) Version: 1.0.0 organism: Ambystoma mexicanum common_name: axolotl genome: GCA_002915635.3 provider: NCBI release_date: April, 2021 BSgenomeObjname: Amexicanum seqnames: paste(seqname$X1[1:dim(seqname)[1]]) circ_seqs: character(0) source_url: https://www.ncbi.nlm.nih.gov/assembly/GCA_002915635.3 organism_biocview: Ambystoma_mexicanum seqs_srcdir: ./GCA_002915635.3

end seed

So my questions: Is there a way to circumvent the need for 2bit? Or, at least chunk the genome into smaller pieces temporarily?

hpages commented 2 years ago

Yes forgeBSgenomeDataPkg() converts to 2bit by default. This is controlled via the ondisk_seq_format field in your seed file, which is considered to be set to 2bit when not specified. I would suggest that you set it to rda instead. See the "How to forge a BSgenome data package" vignette for more information.

Hope this helps.

H.

hpages commented 2 years ago

Hi @stevenjblair ,

Don't know if you were successful. I gave this a try today, and found that there are indeed some difficulties.

  1. When the input to the forging process is FASTA and not 2bit, the giant FASTA file provided by NCBI needs to be split in smaller files (one per sequence), as explained in the vignette.

  2. The sequence names stored in the giant FASTA file provided by NCBI (GCA_002915635.3_AmbMex60DD_genomic.fna.gz) are GenBank accession numbers. This is generally not what we want to use in a BSgenome data package, so we need to rename the sequences with the names from the full sequence report here. Here's the script I used to split the FASTA file and rename the sequences:

    ### Download GCA_002915635.3_AmbMex60DD_genomic.fna.gz from
    ### https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/915/635/GCA_002915635.3_AmbMex60DD/
    
    library(GenomeInfoDb)
    library(Biostrings)
    library(rtracklayer)
    
    ### Load the sequences (takes about 5 min):
    dna <- readDNAStringSet("GCA_002915635.3_AmbMex60DD_genomic.fna.gz")  # takes about 5 min
    
    ### Check the names on the sequences (GenBank accessions):
    current_GenBankAccn <- unlist(heads(strsplit(names(dna), " ", fixed=TRUE), n=1L))
    chrominfo <- getChromInfoFromNCBI("GCA_002915635.3")
    expected_GenBankAccn <- chrominfo[ , "GenBankAccn"]
    stopifnot(setequal(expected_GenBankAccn, current_GenBankAccn))
    
    ### Reorder and rename the sequences:
    dna <- dna[match(expected_GenBankAccn, current_GenBankAccn)]
    names(dna) <- chrominfo[ , "SequenceName"]
    
    ### Check the sequence lengths:
    stopifnot(identical(lengths(dna), chrominfo[ , "SequenceLength"]))
    
    ### Write each sequence to a FASTA file:
    for (i in seq_along(dna)) {
        filename <- paste0(names(dna)[[i]], ".fna")
        cat("saving ", filename, " ... ", sep="")
        writeXStringSet(dna[i], filename)
        cat("ok\n")
    }
    
    cat("DONE.")
  3. When the input to the forging process is one FASTA file per chromosome, the seed file must have the seqnames and circ_seqs fields. The seqnames field must be an R expression that returns a character vector with all the sequence names. Problem is that the AmbMex60DD assembly contains many sequences: 27157 in total! One possibility for the seqnames field is:

    seqnames: GenomeInfoDb::getChromInfoFromNCBI("GCA_002915635.3")$SequenceName

    However, doing this has some drawbacks that I'm not going to explain here. Another way to go is to include only the 28 axolotl chromosomes in the BSgenome data package to forge, and to exclude all the scaffolds (the AmbMex60DD assembly contains 27129 scaffolds). In this case the seqnames field is:

    seqnnames: paste0("chr", rep(1:14, each=2), c("p", "q"))
  4. Finally, I was able to forge BSgenome.Amexicanum.NCBI.ambMex60DD using the following seed file (BSgenome.Amexicanum.NCBI.ambMex60DD-seed):

    Package: BSgenome.Amexicanum.NCBI.ambMex60DD
    Title: Full genome sequences for Ambystoma mexicanum (AmbMex60DD)
    Description: Full genome sequences for Ambystoma mexicanum (Axolotl) as provided by NCBI (assembly AmbMex60DD, assembly accession GCA_002915635.3) and stored in Biostrings objects.
    Version: 1.0.0
    organism: Ambystoma mexicanum
    common_name: Axolotl
    genome: AmbMex60DD
    provider: NCBI
    release_date: 2021/04/01
    source_url: https://www.ncbi.nlm.nih.gov/assembly/GCA_002915635.3
    organism_biocview: Ambystoma_mexicanum
    BSgenomeObjname: Amexicanum
    seqnames: paste0("chr", rep(1:14, each=2), c("p", "q"))
    circ_seqs: character(0)
    SrcDataFiles: GCA_002915635.3_AmbMex60DD_genomic.fna.gz from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/915/635/GCA_002915635.3_AmbMex60DD/
    PkgExamples: genome$chr1p  # same as genome[["chr1p"]]
    seqs_srcdir: path/to/the/folder/with/small/fasta/files
    seqfiles_suffix: .fna
    ondisk_seq_format: rds

    Note that I'm setting ondisk_seq_format to rds instead of rda. The two on disk storage formats are very similar but using rds is actually preferred.

It took about one hour for forgeBSgenomeDataPkg("path/to/BSgenome.Amexicanum.NCBI.ambMex60DD-seed") to complete on my machine. The resulting package source tarball is 7G!

Hope this helps, H.

stevenjblair commented 2 years ago

Great stuff H!

I had not returned to this yet. Was actually going to have another go at it. I had defeated that issue about the GenBank accession numbers using sed. I like your method much better, doing it all in R! Very nice indeed, thank you so much.

I need those other unplaced scaffolds unfortunately. They do cause issues while naming in the seed and I did try different ways to print them all but didn't have much luck. I will tinker around with it later. the rda, or rds in the seed, THIS is what will get me in the right place and I thank you so much. Steve

hpages commented 2 years ago

No problem. Also I just made some adjustments to the BSgenome software package (in version 1.63.5) so that using this kind of seqnames field

seqnames: GenomeInfoDb::getChromInfoFromNCBI("GCA_002915635.3")$SequenceName

should no longer cause problems.

Make sure you grab this version of BSgenome if you want to include all the scaffolds in BSgenome.Amexicanum.NCBI.ambMex60DD. I just did that and it took about 1 h 10 min to forge the package on my machine. Then building the package source tarball with R CMD build took about 10 more min. The size of the resulting tarball is 9.4G!

BSgenome 1.63.5 belongs to BioC 3.15 (current BioC devel) so you need R 4.2 (current R devel). Should become available via BiocManager::install() in the next 48 hours.

Cheers, H.

hpages commented 2 years ago

Maybe we can close this @stevenjblair ? -- H.