Ecogenomics / GTDBNCBI

The GTDB provides the software infrastructure for working with a large collection of genomic resources. The major goal of this initiative is to provide a phylogenetically consistent taxonomy for archaea and bacteria.
https://gtdb.ecogenomic.org/
GNU General Public License v3.0
9 stars 2 forks source link

GenBank genomes without annotations may be annotated in RefSeq #10

Closed donovan-h-parks closed 8 years ago

donovan-h-parks commented 9 years ago

There are GenBank assembles (e.g., GCA_000215505.2_ASM21550v2) that do not have called genes. This occurs when users submit genomes without annotations. In at least some cases, these genomes have been put into RefSeq by NCBI and have been annotated (e.g., GCF_000215505.1_ASM21550v2). It would be good to identify these cases and take the RefSeq genomes instead of the GenBank genomes.

Response from NCBI:

Submitters of genomic sequences (WGS) may or may not provide their own annotation and they can even request NCBI to annotate the sequences for them. Please see the first paragraph on the WGS page: https://www.ncbi.nlm.nih.gov/genbank/wgs

The Anaplasma marginale assembly that you are looking at is at the "contig" assembly level (nothing is assembled beyond the level of sequence contigs): http://www.ncbi.nlm.nih.gov/assembly/GCF_000215505.1/

The corresponding GenBank contigs have not been annotated. NCBI (RefSeq) took the GenBank sequence (61 contigs in this case) and annotated these through the NCBI Prokaryotic Genome Annotation Pipeline.

donovan-h-parks commented 8 years ago

We are now using RefSeq as the primary source of genomes and amending these with Genbank genomes as appropriate.