CSB5 / OPERA-MS

OPERA-MS - Hybrid Metagenomic Assembler
Other
89 stars 17 forks source link

Database of 2,800 complete genomes #37

Closed wangpeng407 closed 4 years ago

wangpeng407 commented 4 years ago

The step5 of OPERA-MS is "computation of Mash genomic distance against a database of 2,800 complete genomes".

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt
awk -F '\t' '{if($12=="Complete Genome") print $20}' assembly_summary.txt | wc -l

In NCBI, there are 16,835 complete genomes. Even if remove the virus, animals, human or other non-bacteria, there are still >2800 complete bacteria genome.

So my question is why authors chose those 2800 genomes as reference, not all NCBI complete genomes?

Thanks ~

dbertran78 commented 4 years ago

Currently the OPERA-MS genome database is based on an old version of the NCBI complete genomes that only contained 2,800 genomes. We will increase the number of species present in the OPERA-MS database in our next release. The updated database will contain genomes from > 20,000 species, and we will provide a script that will allow users to easily generate a custom reference genome database.

Best regards,

--- Denis