Registration request for Catharus ustulatus NCBI assembly

Bioconductor / GenomeInfoDb

Utilities for manipulating chromosome names, including modifying them to follow a particular naming style

https://bioconductor.org/packages/GenomeInfoDb

31 stars 13 forks source link

Registration request for Catharus ustulatus NCBI assembly #78

Closed sopeadeniji closed 2 years ago

sopeadeniji commented 2 years ago

Hi, I would like to have the genome for Catharus ustulatus registered for the purpose of forging a BSgenome package. The assembly is bCatUst1.pri.v2 and the link to NCBI page is below:

https://www.ncbi.nlm.nih.gov/assembly/GCF_009819885.2/

Thanks

hpages commented 2 years ago

Hi @sopeadeniji,

Note that registering the NCBI assembly is not required. It's just that, for an unregistered genome assembly, you must specify the seqnames field in your seed file. In the case of the Catharus ustulatus assembly, you could set seqnames like this:

seqnames: getChromInfoFromNCBI("GCF_009819885.2")$SequenceName

Note that you will also need to specify the circular sequences:

circ_seqs: "bCatUst1_MT"

More precisely, if the genome assembly is registered in GenomeInfoDb, then you don't need to specify the seqnames field because, in this case, forgeBSgenomeDataPkg() will be able to fetch the sequence names for you. So the process of forging a BSgenome data package is just more convenient when the genome assembly is registered in GenomeInfoDb, but it's not a requirement.

Hope this helps, H.

sopeadeniji commented 2 years ago

I will add these lines to my seed file. How would I include the circular sequences in the fasta_to_sorted_2bit.R script used to convert the genome .fna.gz file to .2bit file? Thank you very much for your help.

hpages commented 2 years ago

How would I include the circular sequences in the fasta_to_sorted_2bit.R script used to convert the genome .fna.gz file to .2bit file?

I'm not sure what you mean. The circular sequences are FASTA sequences that are included in the genome .fna.gz file, like any other sequence. Why would fasta_to_sorted_2bit.R need to give them special treatment?

sopeadeniji commented 2 years ago

I am getting the error code below when running the fasta_to_sorted_2bit.R script. I assume it would be non-nuclear dna in the genome sequence.

stopifnot(setequal(expected_RefSeqAccn, current_RefSeqAccn)) Error: setequal(expected_RefSeqAccn, current_RefSeqAccn) is not TRUE

hpages commented 2 years ago

You would need to provide more information if you want us to be able to help e.g.

what genome .fna.gz file are you using (NCBI provides two, GCF_009819885.2_bCatUst1.pri.v2_genomic.fna.gz and GCA_009819885.2_bCatUst1.pri.v2_genomic.fna.gz),
what fasta_to_sorted_2bit.R script are you using (show us the script),
your sessionInfo().

Ideally you'd want to share all the information that others need to reproduce the problem.

That being said, one problem I see is that the bCatUst1_MT sequence has no RefSeq accession number (it's set to NA in the Full sequence report). But it does have a GenBank accession number (CM020378.1). So how about using the genome .fna.gz file that contains the GenBank accession numbers and adapt your fasta_to_sorted_2bit.R script to work with those accession numbers instead of the RefSeq accession numbers?

hpages commented 2 years ago

So it looks like the major problem here is that GCF_009819885.2_bCatUst1.pri.v2_genomic.fna.gz is actually missing the bCatUst1_MT sequence: the file contains only 160 sequences out of the 161 sequences reported in the Full sequence report. So yeah, the only valid choice is to use GCA_009819885.2_bCatUst1.pri.v2_genomic.fna.gz.

I would suggest that you also use the GenBank assembly accession in your seed file:

seqnames: getChromInfoFromNCBI("GCA_009819885.2")$SequenceName

This way you completely stay away from any RefSeq resource.

Hope this helps.

hpages commented 2 years ago

Anyways, I've just registered the bCatUst1.pri.v2 assembly in GenomeInfoDb 1.34.4: see commit cf8d71c1a5c9f87052586e6ef031c609a4d87822

This new version will become available in BioC 3.16 via BiocManager::install() in the next couple of days or so. If you use this version you shouldn't need to specify seqnames or circ_seqs in your seed file anymore. However please note that this new version won't help you with your fasta_to_sorted_2bit.R problem, as the process of converting from FASTA to 2bit is not affected by the assembly being registered or not.

Cheers, H.

sopeadeniji commented 2 years ago

Hi Herve, It worked! I used GCA_009819885.2_bCatUst1.pri.v2_genomic.fna.gz for genome sequence and GenBank accession number in fasta_to_sorted_2bit.R script as suggested. Here's the adjusted fasta_to_sorted_2bit.R below. Thanks a lot for your help.

dna <- readDNAStringSet("GCA_009819885.2_bCatUst1.pri.v2_genomic.fna.gz")

Check seqnames.

current_GenBankAccn <- unlist(heads(strsplit(names(dna), " ", fixed=TRUE), n=1L)) library(GenomeInfoDb chrominfo <- getChromInfoFromNCBI("GCA_009819885.2")

expected_GenBankAccn <- chrominfo[ , "GenBankAccn"] stopifnot(setequal(expected_GenBankAccn, current_GenBankAccn)) dna <- dna[match(expected_GenBankAccn, current_GenBankAccn)]

Rename sequences.

names(dna) <- chrominfo[ , "SequenceName"]

Export as 2bit.

library(rtracklayer) export.2bit(dna, "bCatUst1-pri.v2.sorted.2bit")

Best, Sope

hpages commented 2 years ago

Great! Did you also manage to use the bCatUst1-pri.v2.sorted.2bit file to forge the BSgenome.Custulatus.NCBI.bCatUst1.pri.v2 package? In which case we can close this.

sopeadeniji commented 2 years ago

Yes, I forged the genome. Thanks.