Closed jolespin closed 6 months ago
Hi @jolespin,
I could consider it. The issue is that skani's sketched databases are quite big; I forget the size of skani's GTDB-R214 sketch but it's much larger than sylph's. I'll leave this up and if a lot of other people request it, I'll put a copy up.
Have you looked into any ANI thresholds that could be used to suggest novel species, genus, family, etc? For example, < 95% might be new species within the genus? Or is this too speculative here.
Skani works above ~80%. I don't have a reference right now, but 80-95% should always be genus level similarity, I believe. So yeah, you could use it to do genus-level annotation too. However, two genomes can have < 80% ANI from the same genera, so it won't be perfect.
I was thinking about setting a low threshold like 50% and then running some tests to see if I could set some thresholds for different taxonomic levels but if Skani won't work w/ ANI < 80% then I might need to only focus on genus and species level.
Hi @bluenote-1577,
thank you for this excellent tool, great work! I just setup the GTBD-R214 sketch database, so regarding your comment:
I could consider it. The issue is that skani's sketched databases are quite big; I forget the size of skani's GTDB-R214 sketch but it's much larger than sylph's. I'll leave this up and if a lot of other people request it, I'll put a copy up.
the GTBD-R214 sketch database directory is ~51GB (~23GB as tar.gz or ~19GB as tar.xz). It would be great if the sketch database versions were made available for download
Hi all,
Thanks for the feedback. I didn't realize the compression would be this good. I have updated the R214 sketch as a tar.gz file at https://storage.googleapis.com/skani_files/skani-gtdb-r214-sketch-v0.2.tar.gz now. I'll update the tutorials accordingly.
https://github.com/bluenote-1577/skani/wiki/Pre%E2%80%90sketched-databases see here for info. I'll think about adding more databases/settings as things come up...
Thanks @bluenote-1577 , that is great! Just a small request, is it possible to also provide a checksums file for verification of a successful download? Much appreciated 👍
I'm following along on the tutorial here: https://github.com/bluenote-1577/skani/wiki/Tutorial:-setting-up-the-GTDB-genome-database-to-search-against
I noticed that your other tool Sylph has a few databases built here: https://github.com/bluenote-1577/sylph/wiki/Pre%E2%80%90built-databases
Would it be possible to host the Skani versions of these datasets as well? In particular, the GTDB r214.1.
Regarding your note here:
Have you looked into any ANI thresholds that could be used to suggest novel species, genus, family, etc? For example, < 95% might be new species within the genus? Or is this too speculative here.