Ecogenomics / GTDBNCBI

The GTDB provides the software infrastructure for working with a large collection of genomic resources. The major goal of this initiative is to provide a phylogenetically consistent taxonomy for archaea and bacteria.
https://gtdb.ecogenomic.org/
GNU General Public License v3.0
9 stars 2 forks source link

Need gtdb_cluster_size and gtdb_clustered_genomes fields in metadata_view table #27

Closed donovan-h-parks closed 8 years ago

donovan-h-parks commented 8 years ago

There are two additional columns that would be good to have in the metadata_view table. These fields relate to representative genomes and I am currently calculating them "on-the-fly" for the ARB output file. The two fields are:

1) gtdb_cluster_size: the number of many genomes in a cluster. This should be 0 if a genomes is not a representative. It should equal the size of the cluster if the genome is a representative. Please note that I consider a representative genome to be in its own cluster. That is, if a representative genome only clustered with one other genome the gtdb_cluster_size field is 2. Similarly, if a representative did not cluster with any genomes it still has a size of 1.

2) gtdb_clustered_genomes: this is simply a comma separated list of genomes in a cluster. If a genome is not a representative the field should be None. For representative genomes it is just a list of all genomes in the cluster (including the representative itself!)

These are currently being calculated by the TreeManager. See the SQL query around line 207 and the final assignment to these fields on lines 240 and 241. It would be far better if these fields were in the metadata_view table so they didn't need to be calculated "on-the-fly", but more importantly so they would appear in the metadata table produced by ">gtdb metadata export".

donovan-h-parks commented 8 years ago

Implemented by Pierre.