Ecogenomics / GTDBTk

GTDB-Tk: a toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes.
https://ecogenomics.github.io/GTDBTk/
GNU General Public License v3.0
466 stars 82 forks source link

Mixed bacterial colonies classified as 'Archea'? #184

Closed justinmaire closed 4 years ago

justinmaire commented 5 years ago

Hi!

Not really a problem here, just a curiosity question: I applied GTDB-Tk on 50 bacterial genomes, and a few of them were returned as archeal genomes, which was weird cause those bacteria had previously been characterized through bacterial 16S primers, so I was quite sure they were bacteria and not archea. After looking more closely, it turned out those genomes were abnormally large (8-12Mb), so I put them in a metagenome analysis tool (MG-RAST) which revealed, as expected, that those specific genomes were mixed colonies (two or three different species in general), but bacterial species nonetheless. Any idea why they were classified as archea? Did GTDB go all crazy cause it found every marker in double/triple?

Thanks! Justin

pchaumeil commented 5 years ago

Hello, It does look very large genomes indeed. As a first sanity check, I would recommend running CheckM (https://ecogenomics.github.io/CheckM/) on your dataset to know the Completeness and Contamination of your assemblies. If those genomes are highly contaminated, it will affect the GTDB-Tk classification.

aaronmussig commented 5 years ago

Hi Justin, If it's possible, are you able to provide the genomes? I'm keen to explore how GTDB-Tk behaves with them. Thanks!

justinmaire commented 5 years ago

Thanks for your answers! I did run CheckM and as expected they're contaminated. Just a quick question on that: which threshold do you use for the contamination score to confidently say 'this genome is contaminated'? My contaminated ones are all above 100 and the detailed score clearly show contamination, all the other ones are between 0 and 2, but I have one genome that sits at 18 and has got a few markers showing up twice, so I was just wondering what your thoughts were on those intermediate scores?

Aaron, I'd be more than happy to provide those genomes yes. I've got 8, I'm not sure what the easiest way to do this? (technologically-challenged person here!)

pchaumeil commented 5 years ago

Hello Justin, Currently, we recommend running GTDB-Tk on genomes estimated to be ≥50% complete with ≤10% contamination consistent with community standards ( https://www.nature.com/articles/nbt.3893 ).

pchaumeil commented 4 years ago

Ticket closed due to inactivity.