Closed fplazaonate closed 2 years ago
Hi,
Our clustering procedure relies exclusively on a genome-wide ANI comparison with a 95% ANI and 30% aligned fraction threshold. GTDB does things differently and in certain cases defines species-specific distance cutoffs. There are pros and cons to each approach, but for us our priority was having a consistent and standardized dereplication method that would be comparable across species. An additional issue is that when dealing with hundreds of thousands of MAGs, running GTDB first before dereplication is just not a realistic solution because of its computational overhead.
Given the differences in the approach and the fact that no threshold is perfect there are going to be cases where distinct GTDB species are either merged or split in different dRep clusters. In this case we just need to leave it up to the users to decide how they want to deal with these outliers in their own studies.
Best, Alex
Thanks for the reply. I understand the rationale but I was suggesting this strategy because this is something you did for the Mouse Gastrointestinal Bacterial Catalogue if I understand well.
Anyway, I have noticed that some MAGs have an ANI with their conspecific representative well below 95% . Here is an example for _Adlercreutzia celatusA (Adlercreutzia equolifaciens):
species_representative | bin | ANI |
---|---|---|
GUT_GENOME092042.fasta | GUT_GENOME083854.fasta | 94.028 |
GUT_GENOME092042.fasta | GUT_GENOME247352.fasta | 94.0263 |
GUT_GENOME092042.fasta | GUT_GENOME093985.fasta | 94.0187 |
GUT_GENOME092042.fasta | GUT_GENOME089001.fasta | 93.9522 |
GUT_GENOME092042.fasta | GUT_GENOME082530.fasta | 93.9514 |
GUT_GENOME092042.fasta | GUT_GENOME236938.fasta | 93.9294 |
GUT_GENOME092042.fasta | GUT_GENOME094299.fasta | 93.9258 |
GUT_GENOME092042.fasta | GUT_GENOME262697.fasta | 93.9184 |
GUT_GENOME092042.fasta | GUT_GENOME061216.fasta | 93.905 |
GUT_GENOME092042.fasta | GUT_GENOME081124.fasta | 93.8848 |
GUT_GENOME092042.fasta | GUT_GENOME021153.fasta | 93.8719 |
GUT_GENOME092042.fasta | GUT_GENOME248870.fasta | 93.8593 |
GUT_GENOME092042.fasta | GUT_GENOME075690.fasta | 93.8101 |
GUT_GENOME092042.fasta | GUT_GENOME155712.fasta | 93.6612 |
GUT_GENOME092042.fasta | GUT_GENOME082668.fasta | 93.6024 |
GUT_GENOME092042.fasta | GUT_GENOME085122.fasta | 93.3607 |
GUT_GENOME092042.fasta | GUT_GENOME209095.fasta | 92.9404 |
GUT_GENOME092042.fasta | GUT_GENOME248593.fasta | 92.9212 |
GUT_GENOME092042.fasta | GUT_GENOME286935.fasta | 92.663 |
Hello again,
Indeed for the MGBC we grouped GTDB clusters first. However, in that case we were "only" dealing with 26k MAGs. The human gut catalog has now reached 300k genomes, so it is just not possible to run GTDB first on these (especially since we need to rerun with newer database versions that come out whenever we update the catalogs).
For the clustering, it is also normal for there to be pairwise ANI values below the threshold used for clustering. This is dependent on the clustering procedure used. You can have cases where genome A shares 96% ANI with genome B and genome B shares 97% ANI with genome C, but then genome C shares 93% ANI with genome A. The likelihood of having genomes A/B/C grouped together or not depends on how greedy/conserved the clustering procedure is (complete, average or single linkage).
Alex
Thanks for the explanation. I will perform dereplication with my own criteria then.
Best regards, Florian
Hi,
I have noticed that MGnify merges some distinct species in the same cluster. Some examples are:
In these cases, species delineation ANI cutoff is slighlty above 95% .
I would suggest to perform dereplication in two steps:
What do you think?