Closed Xiaojun928 closed 3 years ago
Hi. Are the duplicated marker genes only duplicated once, or is there a duplicate gene that is present 3 or more times? If you post the entire CheckM output line for this genome, I can better answer your question. CheckM identifies genes that are typically single copy. As you indicated, it is possible that in some instances a gene has been legitimately duplicated.
Hi,
Many thanks for your reply. I found two genes are duplicated and each duplicated only once. Here I posted the CheckM output.
Thanks!
Hi. I am unclear why the strain heterogeneity is 33%. If you can send me the genome, I'll debug the code to see if there is an error in the mix.
It's very nice of you can do help! Please see the attached genome and marker genes (this list is modified according to a closed complete genome, which is very close to GNM3519). I am using the CheckM v1.1.2.
Thanks and regards
Hi,
I ran your genome through CheckM v1.1.2 with checkm lineage_wf . ./checkm -x fasta
. The results indicate 3 duplicated genes:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bin Id Marker lineage # genomes # markers # marker sets 0 1 2 3 4 5+ Completeness Contamination Strain heterogeneity
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
GNM3519 f__Rhodobacteraceae (UID3360) 56 582 313 3 576 3 0 0 0 99.25 0.68 33.33
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
What version of CheckM are you using? What command are you running? Results are extremely similar, but would be nice to know why we have a discrepency. In particular, both our results indicate the same marker set (f__Rhodobacteraceae; UID3360), but that my results indicate this set has 582 markers while your results suggest only 579 markers.
Cheers, Donovan
Hi Donovan,
I'm sorry I did not make it clear in the last message. I run CheckM v1.1.2 with the following command:
checkm lineage_wf -x fasta input_dir output_dir
and it return the same result as yours:
Bin Id | Marker lineage | # genomes | # markers | # marker sets | Completeness | Contamination | Strain heterogeneity | Genome size (bp) | # ambiguous bases | # scaffolds | # contigs | N50 (scaffolds) | N50 (contigs) | Mean scaffold length (bp) | Mean contig length (bp) | Longest scaffold (bp) | Longest contig (bp) | GC | GC std (scaffolds > 1kbp) | Coding density | Translation table | # predicted genes | 0 | 1 | 2 | 3 | 4 | 5+ |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GNM3519 | f__Rhodobacteraceae (UID3360) | 56 | 582 | 313 | 99.25 | 0.68 | 33.33 | 4472332 | 0 | 30 | 30 | 487418 | 487418 | 149077 | 149077 | 804436 | 804436 | 56.6 | 3.23 | 91.03 | 11 | 4399 | 3 | 576 | 3 | 0 | 0 | 0 |
Then three marker genes are removed from the estimation, for they are not present as single copy gene in a closely related complete genome.
| PF13603.1 | | PF02616.9 | | PF08529.6 |
By using modified .ms file:
checkm qa --tab_table -f result.txt -o 2 modified.ms output_dir
the result is:
Bin Id | Marker lineage | # genomes | # markers | # marker sets | Completeness | Contamination | Strain heterogeneity | Genome size (bp) | # ambiguous bases | # scaffolds | # contigs | N50 (scaffolds) | N50 (contigs) | Mean scaffold length (bp) | Mean contig length (bp) | Longest scaffold (bp) | Longest contig (bp) | GC | GC std (scaffolds > 1kbp) | Coding density | Translation table | # predicted genes | 0 | 1 | 2 | 3 | 4 | 5+ |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GNM3519 | f__Rhodobacteraceae (UID3360) | 56 | 579 | 312 | 99.68 | 0.64 | 33.33 | 4472332 | 0 | 30 | 30 | 487418 | 487418 | 149077 | 149077 | 804436 | 804436 | 56.6 | 3.23 | 91.03 | 11 | 4399 | 1 | 576 | 2 | 0 | 0 | 0 |
The modified .ms file is attached. Is it the right way to remove some marker genes in this file?
Besides, another major concern is the high heterogeneity (33%) since I'm working on a population genomic analysis at strain level. Should I perform the resequeuing for this isolation?
Best, Xiaoyuan modified.ms.txt
Hi. Interesting! So, it seems that the strain heterogeneity measure does not respect the removal of the 3 marker genes you indicated. I will fix this in a future release. The 33% (which should be 50%) strain heterogeneity just indicates that of the 2 pairs of duplicated markers, 1 pair meets the AAI threshold used to establish genes as likely being from the same strain. It is interesting that 1 pair doesn't meet this criteria indicating there is a duplicate gene that is typically single copy, present twice in your genome, and apparently the two copies aren't that similar to each other (i.e. not a recent gene duplicate assuming it isn't just straight up actual contamination). Might be worth checking if these two copies are on the ends of contigs indicating that maybe this is just an assembly error.
Hi Donovan, Thanks for your reply. I'll try to check the heterogeneity using other softwares. Thanks again for you help! : )
Hi, I'm using checkM to estimate the genomic features of 20 isolates. However, one genome exhitbits 0.64% contamination (2 duplicated marker genes) and 33% heterogeneity. I wonder why the estimated heterogeneity is not 25% or 50%. Besides, considering I'm working on a population genomic analysis at strain level, is this genome qualified for the downstream analysis? As far as I can cencern, such observation may be resulted from recent gene duplication, or contaminated bacterial colony, or other issues during heterogeneity estimation. Is it common to find such heterogeneity in a pure-cultured bacteria?
Thanks and wishes Xiaoyuan