bluenote-1577 / skani

Fast, robust ANI and aligned fraction for (metagenomic) genomes and contigs.
MIT License
166 stars 10 forks source link

(Feature Request) Host the GTDB r214.1 skani database #29

Closed jolespin closed 6 months ago

jolespin commented 7 months ago

I'm following along on the tutorial here: https://github.com/bluenote-1577/skani/wiki/Tutorial:-setting-up-the-GTDB-genome-database-to-search-against

I noticed that your other tool Sylph has a few databases built here: https://github.com/bluenote-1577/sylph/wiki/Pre%E2%80%90built-databases

Would it be possible to host the Skani versions of these datasets as well? In particular, the GTDB r214.1.

Regarding your note here:

This workflow is magnitudes faster than GTDB-tk, a classification tool associated to the GTDB. However, GTDB-tk is much more sensitive if your assembled genome has no direct species representative in the database. Furthermore, skani does not put the genome on a tree.

Have you looked into any ANI thresholds that could be used to suggest novel species, genus, family, etc? For example, < 95% might be new species within the genus? Or is this too speculative here.

bluenote-1577 commented 7 months ago

Hi @jolespin,

I could consider it. The issue is that skani's sketched databases are quite big; I forget the size of skani's GTDB-R214 sketch but it's much larger than sylph's. I'll leave this up and if a lot of other people request it, I'll put a copy up.

Have you looked into any ANI thresholds that could be used to suggest novel species, genus, family, etc? For example, < 95% might be new species within the genus? Or is this too speculative here.

Skani works above ~80%. I don't have a reference right now, but 80-95% should always be genus level similarity, I believe. So yeah, you could use it to do genus-level annotation too. However, two genomes can have < 80% ANI from the same genera, so it won't be perfect.

jolespin commented 7 months ago

I was thinking about setting a low threshold like 50% and then running some tests to see if I could set some thresholds for different taxonomic levels but if Skani won't work w/ ANI < 80% then I might need to only focus on genus and species level.

motroy commented 6 months ago

Hi @bluenote-1577,

thank you for this excellent tool, great work! I just setup the GTBD-R214 sketch database, so regarding your comment:

I could consider it. The issue is that skani's sketched databases are quite big; I forget the size of skani's GTDB-R214 sketch but it's much larger than sylph's. I'll leave this up and if a lot of other people request it, I'll put a copy up.

the GTBD-R214 sketch database directory is ~51GB (~23GB as tar.gz or ~19GB as tar.xz). It would be great if the sketch database versions were made available for download

bluenote-1577 commented 6 months ago

Hi all,

Thanks for the feedback. I didn't realize the compression would be this good. I have updated the R214 sketch as a tar.gz file at https://storage.googleapis.com/skani_files/skani-gtdb-r214-sketch-v0.2.tar.gz now. I'll update the tutorials accordingly.

bluenote-1577 commented 6 months ago

https://github.com/bluenote-1577/skani/wiki/Pre%E2%80%90sketched-databases see here for info. I'll think about adding more databases/settings as things come up...

motroy commented 6 months ago

Thanks @bluenote-1577 , that is great! Just a small request, is it possible to also provide a checksums file for verification of a successful download? Much appreciated 👍