MGnify merges distinct species in the same cluster

fplazaonate commented 2 years ago

Hi,

I have noticed that MGnify merges some distinct species in the same cluster. Some examples are:

Phocaeicola dorei and Phocaeicola vulgatus
_Adlercreutzia celatusA and Adlercreutzia equolifaciens

In these cases, species delineation ANI cutoff is slighlty above 95% .

I would suggest to perform dereplication in two steps:

Group genomes with the same GTDB annotation at species level if available
Use dRep for genomes without annotation at species level

What do you think?

alexmsalmeida commented 2 years ago

Hi,

Our clustering procedure relies exclusively on a genome-wide ANI comparison with a 95% ANI and 30% aligned fraction threshold. GTDB does things differently and in certain cases defines species-specific distance cutoffs. There are pros and cons to each approach, but for us our priority was having a consistent and standardized dereplication method that would be comparable across species. An additional issue is that when dealing with hundreds of thousands of MAGs, running GTDB first before dereplication is just not a realistic solution because of its computational overhead.

Given the differences in the approach and the fact that no threshold is perfect there are going to be cases where distinct GTDB species are either merged or split in different dRep clusters. In this case we just need to leave it up to the users to decide how they want to deal with these outliers in their own studies.

Best, Alex

fplazaonate commented 2 years ago

Thanks for the reply. I understand the rationale but I was suggesting this strategy because this is something you did for the Mouse Gastrointestinal Bacterial Catalogue if I understand well.

Anyway, I have noticed that some MAGs have an ANI with their conspecific representative well below 95% . Here is an example for _Adlercreutzia celatusA (Adlercreutzia equolifaciens):

species_representative	bin	ANI
GUT_GENOME092042.fasta	GUT_GENOME083854.fasta	94.028
GUT_GENOME092042.fasta	GUT_GENOME247352.fasta	94.0263
GUT_GENOME092042.fasta	GUT_GENOME093985.fasta	94.0187
GUT_GENOME092042.fasta	GUT_GENOME089001.fasta	93.9522
GUT_GENOME092042.fasta	GUT_GENOME082530.fasta	93.9514
GUT_GENOME092042.fasta	GUT_GENOME236938.fasta	93.9294
GUT_GENOME092042.fasta	GUT_GENOME094299.fasta	93.9258
GUT_GENOME092042.fasta	GUT_GENOME262697.fasta	93.9184
GUT_GENOME092042.fasta	GUT_GENOME061216.fasta	93.905
GUT_GENOME092042.fasta	GUT_GENOME081124.fasta	93.8848
GUT_GENOME092042.fasta	GUT_GENOME021153.fasta	93.8719
GUT_GENOME092042.fasta	GUT_GENOME248870.fasta	93.8593
GUT_GENOME092042.fasta	GUT_GENOME075690.fasta	93.8101
GUT_GENOME092042.fasta	GUT_GENOME155712.fasta	93.6612
GUT_GENOME092042.fasta	GUT_GENOME082668.fasta	93.6024
GUT_GENOME092042.fasta	GUT_GENOME085122.fasta	93.3607
GUT_GENOME092042.fasta	GUT_GENOME209095.fasta	92.9404
GUT_GENOME092042.fasta	GUT_GENOME248593.fasta	92.9212
GUT_GENOME092042.fasta	GUT_GENOME286935.fasta	92.663

alexmsalmeida commented 2 years ago

Hello again,

Indeed for the MGBC we grouped GTDB clusters first. However, in that case we were "only" dealing with 26k MAGs. The human gut catalog has now reached 300k genomes, so it is just not possible to run GTDB first on these (especially since we need to rerun with newer database versions that come out whenever we update the catalogs).

For the clustering, it is also normal for there to be pairwise ANI values below the threshold used for clustering. This is dependent on the clustering procedure used. You can have cases where genome A shares 96% ANI with genome B and genome B shares 97% ANI with genome C, but then genome C shares 93% ANI with genome A. The likelihood of having genomes A/B/C grouped together or not depends on how greedy/conserved the clustering procedure is (complete, average or single linkage).

Alex

fplazaonate commented 2 years ago

Thanks for the explanation. I will perform dereplication with my own criteria then.

Best regards, Florian

EBI-Metagenomics / genomes-catalogue-pipeline

MGnify merges distinct species in the same cluster #14