EBI-Metagenomics / genomes-catalogue-pipeline

MGnify genome analysis pipeline
Other
100 stars 21 forks source link

MGnify merges distinct species in the same cluster #14

Closed fplazaonate closed 2 years ago

fplazaonate commented 2 years ago

Hi,

I have noticed that MGnify merges some distinct species in the same cluster. Some examples are:

In these cases, species delineation ANI cutoff is slighlty above 95% .

I would suggest to perform dereplication in two steps:

  1. Group genomes with the same GTDB annotation at species level if available
  2. Use dRep for genomes without annotation at species level

What do you think?

alexmsalmeida commented 2 years ago

Hi,

Our clustering procedure relies exclusively on a genome-wide ANI comparison with a 95% ANI and 30% aligned fraction threshold. GTDB does things differently and in certain cases defines species-specific distance cutoffs. There are pros and cons to each approach, but for us our priority was having a consistent and standardized dereplication method that would be comparable across species. An additional issue is that when dealing with hundreds of thousands of MAGs, running GTDB first before dereplication is just not a realistic solution because of its computational overhead.

Given the differences in the approach and the fact that no threshold is perfect there are going to be cases where distinct GTDB species are either merged or split in different dRep clusters. In this case we just need to leave it up to the users to decide how they want to deal with these outliers in their own studies.

Best, Alex

fplazaonate commented 2 years ago

Thanks for the reply. I understand the rationale but I was suggesting this strategy because this is something you did for the Mouse Gastrointestinal Bacterial Catalogue if I understand well.

Anyway, I have noticed that some MAGs have an ANI with their conspecific representative well below 95% . Here is an example for _Adlercreutzia celatusA (Adlercreutzia equolifaciens):

species_representative bin ANI
GUT_GENOME092042.fasta GUT_GENOME083854.fasta 94.028
GUT_GENOME092042.fasta GUT_GENOME247352.fasta 94.0263
GUT_GENOME092042.fasta GUT_GENOME093985.fasta 94.0187
GUT_GENOME092042.fasta GUT_GENOME089001.fasta 93.9522
GUT_GENOME092042.fasta GUT_GENOME082530.fasta 93.9514
GUT_GENOME092042.fasta GUT_GENOME236938.fasta 93.9294
GUT_GENOME092042.fasta GUT_GENOME094299.fasta 93.9258
GUT_GENOME092042.fasta GUT_GENOME262697.fasta 93.9184
GUT_GENOME092042.fasta GUT_GENOME061216.fasta 93.905
GUT_GENOME092042.fasta GUT_GENOME081124.fasta 93.8848
GUT_GENOME092042.fasta GUT_GENOME021153.fasta 93.8719
GUT_GENOME092042.fasta GUT_GENOME248870.fasta 93.8593
GUT_GENOME092042.fasta GUT_GENOME075690.fasta 93.8101
GUT_GENOME092042.fasta GUT_GENOME155712.fasta 93.6612
GUT_GENOME092042.fasta GUT_GENOME082668.fasta 93.6024
GUT_GENOME092042.fasta GUT_GENOME085122.fasta 93.3607
GUT_GENOME092042.fasta GUT_GENOME209095.fasta 92.9404
GUT_GENOME092042.fasta GUT_GENOME248593.fasta 92.9212
GUT_GENOME092042.fasta GUT_GENOME286935.fasta 92.663
alexmsalmeida commented 2 years ago

Hello again,

Indeed for the MGBC we grouped GTDB clusters first. However, in that case we were "only" dealing with 26k MAGs. The human gut catalog has now reached 300k genomes, so it is just not possible to run GTDB first on these (especially since we need to rerun with newer database versions that come out whenever we update the catalogs).

For the clustering, it is also normal for there to be pairwise ANI values below the threshold used for clustering. This is dependent on the clustering procedure used. You can have cases where genome A shares 96% ANI with genome B and genome B shares 97% ANI with genome C, but then genome C shares 93% ANI with genome A. The likelihood of having genomes A/B/C grouped together or not depends on how greedy/conserved the clustering procedure is (complete, average or single linkage).

Alex

fplazaonate commented 2 years ago

Thanks for the explanation. I will perform dereplication with my own criteria then.

Best regards, Florian