EBI-Metagenomics / EukCC

Tool to estimate genome quality of microbial eukaryotes
GNU General Public License v3.0
31 stars 9 forks source link

max_silent_contamination ? #19

Closed michoug closed 3 years ago

michoug commented 3 years ago

Hi, I don't understand what the max_silent_contamination means in the eukcc.tsv file. For a MAG, I have 0 contamination and 96.03 of max_silent_contamination, for example Best Greg

openpaul commented 3 years ago

Hello, sorry for the delay, i was off for a couple days.

This feature is not yet fully documented, so its good that you ask.

If you use EukCC to also predict proteins (Using GeneMark-ES). EukCC can associate marker genes found with contig names and thus can estimate how many contigs did not contribute a single copy marker gene to the estimated completeness/contamiantion score.

Thus if all SCMGs are located only on 50% of the contigs, this means that possibly up to 50% of the remaining contigs could be contamination from a foreign genome and we would not know.

With fewer contigs, this max silent contamination will go down, and thus reward higher quality assemblies with a more confident score.

I would not worry too much about it, it makes the uncertainty visible, but that uncertainty was always there. So a max silent contamination of 96%, means that only 4% of your MAGs DNA have contributed to compute the completeness/contamination score. Thats not uncommon for shot read assembled metagenome.

Hope that explains it.