Why two contaminated mark genes lead to 33% heterogeneity

Xiaojun928 commented 4 years ago

Hi, I'm using checkM to estimate the genomic features of 20 isolates. However, one genome exhitbits 0.64% contamination (2 duplicated marker genes) and 33% heterogeneity. I wonder why the estimated heterogeneity is not 25% or 50%. Besides, considering I'm working on a population genomic analysis at strain level, is this genome qualified for the downstream analysis? As far as I can cencern, such observation may be resulted from recent gene duplication, or contaminated bacterial colony, or other issues during heterogeneity estimation. Is it common to find such heterogeneity in a pure-cultured bacteria?

Thanks and wishes Xiaoyuan

donovan-h-parks commented 4 years ago

Hi. Are the duplicated marker genes only duplicated once, or is there a duplicate gene that is present 3 or more times? If you post the entire CheckM output line for this genome, I can better answer your question. CheckM identifies genes that are typically single copy. As you indicated, it is possible that in some instances a gene has been legitimately duplicated.

Xiaojun928 commented 4 years ago

Hi，

Many thanks for your reply. I found two genes are duplicated and each duplicated only once. Here I posted the CheckM output.

Thanks!

GNM3519_gene_presence.txt GNM3519_checkm_feature.txt

donovan-h-parks commented 4 years ago

Hi. I am unclear why the strain heterogeneity is 33%. If you can send me the genome, I'll debug the code to see if there is an error in the mix.

Xiaojun928 commented 4 years ago

It's very nice of you can do help! Please see the attached genome and marker genes (this list is modified according to a closed complete genome, which is very close to GNM3519). I am using the CheckM v1.1.2.

Thanks and regards

full.ms.txt GNM3519.fasta.txt

donovan-h-parks commented 4 years ago

Hi,

I ran your genome through CheckM v1.1.2 with checkm lineage_wf . ./checkm -x fasta. The results indicate 3 duplicated genes:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Bin Id            Marker lineage          # genomes   # markers   # marker sets   0    1    2   3   4   5+   Completeness   Contamination   Strain heterogeneity
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
  GNM3519   f__Rhodobacteraceae (UID3360)       56         582           313        3   576   3   0   0   0       99.25            0.68              33.33
--------------------------------------------------------------------------------------------------------------------------------------------------------------------

What version of CheckM are you using? What command are you running? Results are extremely similar, but would be nice to know why we have a discrepency. In particular, both our results indicate the same marker set (f__Rhodobacteraceae; UID3360), but that my results indicate this set has 582 markers while your results suggest only 579 markers.

Cheers, Donovan

Xiaojun928 commented 4 years ago

Hi Donovan,

I'm sorry I did not make it clear in the last message. I run CheckM v1.1.2 with the following command:

checkm lineage_wf -x fasta input_dir output_dir

and it return the same result as yours:

Bin Id	Marker lineage	# genomes	# markers	# marker sets	Completeness	Contamination	Strain heterogeneity	Genome size (bp)	# ambiguous bases	# scaffolds	# contigs	N50 (scaffolds)	N50 (contigs)	Mean scaffold length (bp)	Mean contig length (bp)	Longest scaffold (bp)	Longest contig (bp)	GC	GC std (scaffolds > 1kbp)	Coding density	Translation table	# predicted genes	0	1	2	3	4	5+
GNM3519	f__Rhodobacteraceae (UID3360)	56	582	313	99.25	0.68	33.33	4472332	0	30	30	487418	487418	149077	149077	804436	804436	56.6	3.23	91.03	11	4399	3	576	3	0	0	0

Then three marker genes are removed from the estimation, for they are not present as single copy gene in a closely related complete genome.

| PF13603.1 | | PF02616.9 | | PF08529.6 |

By using modified .ms file:

checkm qa --tab_table -f result.txt -o 2 modified.ms output_dir

the result is:

Bin Id	Marker lineage	# genomes	# markers	# marker sets	Completeness	Contamination	Strain heterogeneity	Genome size (bp)	# ambiguous bases	# scaffolds	# contigs	N50 (scaffolds)	N50 (contigs)	Mean scaffold length (bp)	Mean contig length (bp)	Longest scaffold (bp)	Longest contig (bp)	GC	GC std (scaffolds > 1kbp)	Coding density	Translation table	# predicted genes	0	1	2	3	4	5+
GNM3519	f__Rhodobacteraceae (UID3360)	56	579	312	99.68	0.64	33.33	4472332	0	30	30	487418	487418	149077	149077	804436	804436	56.6	3.23	91.03	11	4399	1	576	2	0	0	0

The modified .ms file is attached. Is it the right way to remove some marker genes in this file?

Besides, another major concern is the high heterogeneity (33%) since I'm working on a population genomic analysis at strain level. Should I perform the resequeuing for this isolation?

Best, Xiaoyuan modified.ms.txt

donovan-h-parks commented 4 years ago

Hi. Interesting! So, it seems that the strain heterogeneity measure does not respect the removal of the 3 marker genes you indicated. I will fix this in a future release. The 33% (which should be 50%) strain heterogeneity just indicates that of the 2 pairs of duplicated markers, 1 pair meets the AAI threshold used to establish genes as likely being from the same strain. It is interesting that 1 pair doesn't meet this criteria indicating there is a duplicate gene that is typically single copy, present twice in your genome, and apparently the two copies aren't that similar to each other (i.e. not a recent gene duplicate assuming it isn't just straight up actual contamination). Might be worth checking if these two copies are on the ends of contigs indicating that maybe this is just an assembly error.

Xiaojun928 commented 4 years ago

Hi Donovan, Thanks for your reply. I'll try to check the heterogeneity using other softwares. Thanks again for you help! : )

Ecogenomics / CheckM

Why two contaminated mark genes lead to 33% heterogeneity #263