ssu_finder find duplicate/triplicate ssu

Ecogenomics / CheckM

Assess the quality of microbial genomes recovered from isolates, single cells, and metagenomes

https://ecogenomics.github.io/CheckM/

GNU General Public License v3.0

344 stars 73 forks source link

ssu_finder find duplicate/triplicate ssu #188

Closed chloelulu closed 5 years ago

chloelulu commented 5 years ago

Hi, developer,

I am confusing about the ssu_finder in CheckM. I have used RefineM to filter thedivergent genome properties, Removing contamination based on taxonomic assignments, incongruent 16s in my genome bins. Then I want to extract the SSU reads existed in the genome bins, so I used the ssu_finder in CheckM to find the SSU hits. ssu_finder gave me ssu_summary table, and bacteria, archaea, and euk table, I checked the bacteria and archaea table, a lot of hits are duplicated, some are the same length, some are not, some are similar. I summarize the ssu.bacteria.txt and ssu.archaea.txt table into one table. As described below, And some bins have duplicate or triplicate SSU, as what I understand if I use Refinem to filter the incongruent ssu, there could only be one/zero ssu in the bin? Is there anything wrong? Or where I am wrong? Thanks in advance.

donovan-h-parks commented 5 years ago

Not sure I follow. CheckM ssu_finder will identify SSU genes in your MAGs. It does this by running bacterial, bacterial, and euk specific SSU HMMs. Since these 3 models are similar, it is not unusual to have a valid hit to all 3 models. RefineM will attempt to identify incongruent SSU genes in a MAG. RefineM does not try to reduce a genome down to a single SSU gene. It is not unusual for prokaryotic genomes to have multiple SSU genes.

chloelulu commented 5 years ago

@dparks1134 , appreciate your quick response! Oh, I see, it is gene, not sequences. So may I please have a follow-up question? Since currently, my purpose is to check the long 16S rRNA reads covered in the genome bins (not the only genes), but now I only have got multiple 16s genes in one genome. Is it will be a similar method as what you described in your paper 'Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life' (These genes were aligned with ssu-align64 v.0.1 and trailing or leading columns represented by ≤70% of taxa trimmed, which resulted in bacterial and archaeal alignments of 1,421 and 1,378 bp, respectively. Trees were inferred with FastTree v.2.1.7 under the GTR+GAMMA models and support values determined using 100 non-parametric bootstrap replicates.) Sorry for so many questions and confusion. Thanks sooooo much!

donovan-h-parks commented 5 years ago

The method described in the UBA manuscript is a fairly standard workflow for creating an SSU tree in my opinion. There are lots of ways to build an SSU tree though. Many people insert their sequences into the SILVA SSU tree using ARB. Others prefer the SILVA SINA aligner followed by de novo tree inference.

chloelulu commented 5 years ago

Great! @dparks1134 Thanks! I will have a try.

chloelulu commented 5 years ago

Hi, @dparks1134 , Sorry for a follow-up question again. When I try to insert the 16S rRNA gene into the tree. Having considering this for a while, I still do not understand which you mentioned:

Since these 3 models are similar, it is not unusual to have a valid hit to all 3 models.

Such as in the figure I inserted in the question. For the first hit contig, there has an overlap of the first 2 valid hits from ssu-finder. And I also blast these hits to the RDP classifer, and both of them belongs to the same taxonomy, which is bacteria. Shall I need to put both hits fragments to make the inference tree? Really appreciate your help! Tons of thanks!!! Sincerely.

donovan-h-parks commented 5 years ago

No. Both models are identifying the same 16S rRNA sequence. They simply disagree exactly where this gene starts and ends. I believe CheckM produces another file indicating which of these 3 models produces the highest bitscore. I usually take this as the best identification of the 16S sequence.

chloelulu commented 5 years ago

Thank you so much @dparks1134 Better understanding now. Really thankful!

473021677 commented 4 years ago

Hello： I have used ssu_finder in CheckM to identify 16S rRNA genes in one genome.but got two 16S rRNA genes for each of bacterial, archaeal, and euk specific SSU HMMs. For archaeal SSU HMMs, shall I selected the hit with the higher bitscore as the best identification of the 16S sequence? Thanks very much.

donovan-h-parks commented 4 years ago

Hi. The ssu_finder method runs HMMs for bacteria, archaea, and euks. Given the sequence similarity between SSU/LSU sequences from these domains it is expected all three HMMs will return hits. I believe the *.fna file selects the model with the highest bitscore. The raw HMM results are only provide for transparency and not intended to be processed manually be users.

473021677 commented 4 years ago

Thanks for your help. I have got the ssu.fna file but it included two sequences. I don’t know how to choose one of them as the best identification of 16S rRNA genes. Should I select the one with higher bitscores？ ---- 原始邮件 ---- From:"Donovan Parks"<notifications@github.com>; Date:2020年6月22日(星期一) 凌晨1:25 To:"Ecogenomics/CheckM"<CheckM@noreply.github.com>; Cc:"473021677"<yuany48@mail2.sysu.edu.cn>;"Comment"<comment@noreply.github.com>; Subject:Re: [Ecogenomics/CheckM] ssu_finder find duplicate/triplicate ssu (#188)

Hi. The ssu_finder method runs HMMs for bacteria, archaea, and euks. Given the sequence similarity between SSU/LSU sequences from these domains it is expected all three HMMs will return hits. I believe the *.fna file selects the model with the highest bitscore. The raw HMM results are only provide for transparency and not intended to be processed manually be users.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.