Closed sopeadeniji closed 2 years ago
Hi @sopeadeniji,
Note that registering the NCBI assembly is not required. It's just that, for an unregistered genome assembly, you must specify the seqnames
field in your seed file. In the case of the Catharus ustulatus assembly, you could set seqnames
like this:
seqnames: getChromInfoFromNCBI("GCF_009819885.2")$SequenceName
Note that you will also need to specify the circular sequences:
circ_seqs: "bCatUst1_MT"
More precisely, if the genome assembly is registered in GenomeInfoDb, then you don't need to specify the seqnames
field because, in this case, forgeBSgenomeDataPkg()
will be able to fetch the sequence names for you. So the process of forging a BSgenome data package is just more convenient when the genome assembly is registered in GenomeInfoDb, but it's not a requirement.
Hope this helps, H.
I will add these lines to my seed file. How would I include the circular sequences in the fasta_to_sorted_2bit.R script used to convert the genome .fna.gz file to .2bit file? Thank you very much for your help.
How would I include the circular sequences in the fasta_to_sorted_2bit.R script used to convert the genome .fna.gz file to .2bit file?
I'm not sure what you mean. The circular sequences are FASTA sequences that are included in the genome .fna.gz file, like any other sequence. Why would fasta_to_sorted_2bit.R need to give them special treatment?
I am getting the error code below when running the fasta_to_sorted_2bit.R script. I assume it would be non-nuclear dna in the genome sequence.
stopifnot(setequal(expected_RefSeqAccn, current_RefSeqAccn)) Error: setequal(expected_RefSeqAccn, current_RefSeqAccn) is not TRUE
You would need to provide more information if you want us to be able to help e.g.
GCF_009819885.2_bCatUst1.pri.v2_genomic.fna.gz
and GCA_009819885.2_bCatUst1.pri.v2_genomic.fna.gz
),fasta_to_sorted_2bit.R
script are you using (show us the script),sessionInfo()
.Ideally you'd want to share all the information that others need to reproduce the problem.
That being said, one problem I see is that the bCatUst1_MT sequence has no RefSeq accession number (it's set to NA in the Full sequence report). But it does have a GenBank accession number (CM020378.1). So how about using the genome .fna.gz file that contains the GenBank accession numbers and adapt your fasta_to_sorted_2bit.R
script to work with those accession numbers instead of the RefSeq accession numbers?
H.
So it looks like the major problem here is that GCF_009819885.2_bCatUst1.pri.v2_genomic.fna.gz
is actually missing the bCatUst1_MT sequence: the file contains only 160 sequences out of the 161 sequences reported in the Full sequence report. So yeah, the only valid choice is to use GCA_009819885.2_bCatUst1.pri.v2_genomic.fna.gz
.
I would suggest that you also use the GenBank assembly accession in your seed file:
seqnames: getChromInfoFromNCBI("GCA_009819885.2")$SequenceName
This way you completely stay away from any RefSeq resource.
Hope this helps.
Anyways, I've just registered the bCatUst1.pri.v2 assembly in GenomeInfoDb 1.34.4: see commit cf8d71c1a5c9f87052586e6ef031c609a4d87822
This new version will become available in BioC 3.16 via BiocManager::install()
in the next couple of days or so. If you use this version you shouldn't need to specify seqnames
or circ_seqs
in your seed file anymore. However please note that this new version won't help you with your fasta_to_sorted_2bit.R
problem, as the process of converting from FASTA to 2bit is not affected by the assembly being registered or not.
Cheers, H.
Hi Herve, It worked! I used GCA_009819885.2_bCatUst1.pri.v2_genomic.fna.gz for genome sequence and GenBank accession number in fasta_to_sorted_2bit.R script as suggested. Here's the adjusted fasta_to_sorted_2bit.R below. Thanks a lot for your help.
dna <- readDNAStringSet("GCA_009819885.2_bCatUst1.pri.v2_genomic.fna.gz")
current_GenBankAccn <- unlist(heads(strsplit(names(dna), " ", fixed=TRUE), n=1L)) library(GenomeInfoDb chrominfo <- getChromInfoFromNCBI("GCA_009819885.2")
expected_GenBankAccn <- chrominfo[ , "GenBankAccn"] stopifnot(setequal(expected_GenBankAccn, current_GenBankAccn)) dna <- dna[match(expected_GenBankAccn, current_GenBankAccn)]
names(dna) <- chrominfo[ , "SequenceName"]
library(rtracklayer) export.2bit(dna, "bCatUst1-pri.v2.sorted.2bit")
Best, Sope
Great! Did you also manage to use the bCatUst1-pri.v2.sorted.2bit
file to forge the BSgenome.Custulatus.NCBI.bCatUst1.pri.v2 package? In which case we can close this.
Yes, I forged the genome. Thanks.
Hi, I would like to have the genome for Catharus ustulatus registered for the purpose of forging a BSgenome package. The assembly is bCatUst1.pri.v2 and the link to NCBI page is below:
https://www.ncbi.nlm.nih.gov/assembly/GCF_009819885.2/
Thanks