GTDB for large number of genomes (1.3M)

intikhab commented 1 week ago

Dear GTDB team,

I have a few questions while I process large number of MAGs using GTDB-tk version 2.4.

If I already have gtdb220 based results from skANI, is there a way to establish associated taxonomy lineage? e.g. accessions to lineage, etc.
Full run of gtdb-tk on 1.3 million MAGs is stuck although I use 3 Tb of RAM and 40 CPUs. If I separately calculate the mash for query genomes, is there a way to process this data with better speed?
I am also running anirep for these MAGs that is also slower at the start. If I calculate mash version of query genomes and calculate mash distances, is there a script from gtdb-tk repository for each of these steps separately and to process the workflow after mash distances are complete?
I used -f (full tree) option in one of the runs and this appears to be very slow so far.

Any suggestions on the above would be great to move forward.

Many Thanks, Intikhab

donovan-h-parks commented 1 week ago

Hi Intikhab,

1) Yes, but not using GTDB-Tk. You'd need to write your own script to determine which of your MAGs are similar enough to existing GTDB species clusters for them to be assigned to that species. GTDB species clusters generally are define as >=95% ANI, but this can be as high as 97% ANI. The following file indicates the ANI threshold for each GTDB species cluster: https://data.gtdb.ecogenomic.org/releases/release220/220.0/auxillary_files/sp_clusters_r220.tsv

2) I don't see an easy way to improve speed. I would recommend that you run your MAGs in batches of 5,000 or 10,000. This is how we typically run large numbers of MAGs. This lets you better monitor progress and ensures that one "bad" genome that might crash GTDB-Tk doesn't run the entire run.

3) The GTDB-Tk workflow is divided into individual steps, but it isn't designed for people to provide their own Mash calculations as an input. This is probably possible, but you'd need to modify the GTDB-Tk code.

4) The -f flag is not recommended if you have a large number of MAGs.

Cheers, Donovan

jianshu93 commented 1 week ago

Hi Both, Perhaps try this? https://github.com/jean-pierreBoth/gsearch, for any number of database genomes and query genomes.

intikhab commented 1 week ago

Hi Jianshu,

gsearch looks very powerful but it also only provides ANI and closest reference genomes. Finding closest lineage still requires something like accession to taxon id and taxon id to lineages, considering say 95% ANI.

I have now dereplicated 1.3 million MAGs using skder/skani leading to 786,807 nr MAGs.

Now using ani_rep from gtdb-tk, There is some progress, as below: [2024-11-05 16:46:40] INFO: Creating Mash sketch file: ... [2024-11-05 18:11:44] INFO: Completed 786,807 genomes in 85.06 minutes (9,249.88 genomes/minute). [2024-11-05 18:11:44] INFO: Loading data from existing Mash sketch file: ../gtdb220_mashdb.msh [2024-11-05 18:11:50] INFO: Calculating Mash distances.

I am wondering when ANI and closest genome accession is available, is there a straightforward way to obtain taxonomic lineage, considering say >=95% ANI?

I am trying another round of skder/skani dereplication using 95% ANI so that I could obtain monphylytic representative genomes which I may be able to pass through gtdb-tk comparitively fast.

Best Wishes, Intikhab

-- Intikhab Alam, PhD

Senior Research Scientist CEMSE Division, Building #3, Office #4328 4700 King Abdullah University of Science and Technology (KAUST) Thuwal 23955-6900, KSA W: http://www.kaust.edu.sa https://webmail.kaust.edu.sa/owa/redir.aspx?C=wkduJ0ChSE-OkyUQwL9vutDH6L5Gg9EImiJ7GyYOxcPLuActd9iwo85DHDgQZup2zR1MyXCk7as.&URL=http%3a%2f%2fwww.kaust.edu.sa T +966 (0) 2 808-2423 F +966 (2) 802 0127

From: Jianshu_Zhao @.> Sent: Tuesday, November 5, 2024 21:33 To: Ecogenomics/GTDBTk @.> Cc: Intikhab Alam @.>; Author @.> Subject: [EXTERNAL] Re: [Ecogenomics/GTDBTk] GTDB for large number of genomes (1.3M) (Issue #611)

Hi Both, Perhaps try this? https://github.com/jean-pierreBoth/gsearch https://urldefense.com/v3/__https://github.com/jean-pierreBoth/gsearch__;!!Nmw4Hv0!xwX0DBHD-NYSdvcSgRHembMGk0jJuZi_nyEzp9J6boBuMwRNLatJNAzD17mc2VPsFHd4XBexuaflOS2KmTS4eovIbvpFoIg$, for any number of database genomes and query genomes.

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/Ecogenomics/GTDBTk/issues/611*issuecomment-2457893683__;Iw!!Nmw4Hv0!xwX0DBHD-NYSdvcSgRHembMGk0jJuZi_nyEzp9J6boBuMwRNLatJNAzD17mc2VPsFHd4XBexuaflOS2KmTS4eovITwtQdoU$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAV63ESJEO5KUOP2HFO2B5DZ7EFN7AVCNFSM6AAAAABRGUDWMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJXHA4TGNRYGM__;!!Nmw4Hv0!xwX0DBHD-NYSdvcSgRHembMGk0jJuZi_nyEzp9J6boBuMwRNLatJNAzD17mc2VPsFHd4XBexuaflOS2KmTS4eovIPxgIDeg$. You are receiving this because you authored the thread.Message ID: @.***>

jianshu93 commented 1 week ago

Each genome are attached to a taxonomy, it is just there for any database that were made with taxonomy. In any case, taxonomy were named based on ANI and AAI. So what's the real problem then? Jianshu

intikhab commented 1 week ago

Dear Jianshu,

Thanks for your point, yes, I agree each NCBI genome is attached to a taxonomy.

However, for each of the query genome you need to decide which taxonomic level is an appropriate assignment e.g. say we assign the strain level if ANI is 100% and Alignment_fraction is also 100%. For species level, it is recommended to have >=95% identity. If the top neighbour reference genome shows <95% ANI, may be genus level can be assigned.

gsearch and skANI provide closest neighbour reference genomes but next step of assignment to a taxon level is not available. This step helps us to evaluate novel vs known taxons e.g. if we can assign genus level taxonomy to a query genome, this perhaps shows you have found a novel species.

Are you able to add a step in gsearch for taxonomic lineage assignment to identify known vs novel species? It would be a good addition.

Best Wishes, Intikhab

-- Intikhab Alam, PhD

Senior Research Scientist CEMSE Division, Building #3, Office #4328 4700 King Abdullah University of Science and Technology (KAUST) Thuwal 23955-6900, KSA W: http://www.kaust.edu.sa https://webmail.kaust.edu.sa/owa/redir.aspx?C=wkduJ0ChSE-OkyUQwL9vutDH6L5Gg9EImiJ7GyYOxcPLuActd9iwo85DHDgQZup2zR1MyXCk7as.&URL=http%3a%2f%2fwww.kaust.edu.sa T +966 (0) 2 808-2423 F +966 (2) 802 0127

From: Jianshu_Zhao @.> Sent: Tuesday, November 5, 2024 22:59 To: Ecogenomics/GTDBTk @.> Cc: Intikhab Alam @.>; Author @.> Subject: [EXTERNAL] Re: [Ecogenomics/GTDBTk] GTDB for large number of genomes (1.3M) (Issue #611)

Each genome are attached to a taxonomy, it is just there for any database that were made with taxonomy. In any case, taxonomy were named based on ANI and AAI. So what's the real problem then? Jianshu

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/Ecogenomics/GTDBTk/issues/611*issuecomment-2458044886__;Iw!!Nmw4Hv0!34IEEWuXk2LZi_g8_STTAF8d788KJcapvuh52IGXnx9h3okMN82eGyBQaeKLfcQWniuF-eQGamH2fh1ukecAcMPO_3bMMV8$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAV63EXDFYVF2EKNQ6QGC5DZ7EPSPAVCNFSM6AAAAABRGUDWMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJYGA2DIOBYGY__;!!Nmw4Hv0!34IEEWuXk2LZi_g8_STTAF8d788KJcapvuh52IGXnx9h3okMN82eGyBQaeKLfcQWniuF-eQGamH2fh1ukecAcMPOv2FsfwM$. You are receiving this because you authored the thread.Message ID: @.***>

intikhab commented 1 week ago

Hi Donovan,

Regarding GTDB taxonomic lineage assignment at genus, family, order, class, phyla or kingdom level, do you use ANI values below 85%? E.g. If the close genome AF is >=50 and ANI is below <90, gtdb assigns family level taxonomy?

A fast approach for large number of genomes can be to have skANI/gsearch or gtdb-tk ani_rep results with ANI and AF that could be processed further for to assign taxonomic lineages for query genomes.

Do we have such a feature internally in gtdb-tk, that could be provided as an option?

Intikhab

Ecogenomics / GTDBTk

GTDB for large number of genomes (1.3M) #611