dnbaker / dashing

Fast and accurate genomic distances using HyperLogLog
GNU General Public License v3.0
160 stars 11 forks source link

Standard Mash-like DB lookup #32

Closed oschwengers closed 4 years ago

oschwengers commented 4 years ago

Hi, thanks for this fast alternative! I'd like to test dashing as a replacement for Mash in an inhouse-pipeline and maybe to replace Kraken in https://github.com/oschwengers/asap.

I'm a little bit puzzled by the different subcommands and parameters. What is the best way to:

  1. create a sketch database from a set of fasta files (bacterial genomes, 1 genome per file)
  2. compute and get a list of the closest genomes from this db against a query genome (preferably fasta or pre-sketched) fulfilling a certain dist threshold.

Could you comment on this and maybe provide a short list of commands? Thanks a lot and best regards!

dnbaker commented 4 years ago

Sure!

Let's say you have a list of reference genomes in fnames.txt and a list of query fastas/assemblies in qnames.txt, and that you set kmerlen and numthreads as desired. First, you'd sketch the references by:

dashing sketch -F fnames.txt -k $kmerlen -p $numthreads # Optional: spacing pattern for the seed

This then caches a sketch for each file. Then, you get a list of distances by:

dashing dist -k $kmerlen -p $numthreads -F fnames.txt -Q qnames.txt -osizes.txt -Odistmat.tsv

If you pass -C as well, then dashing will cache sketches for the query genomes. You can change which directory the cached sketches are stored in with the -P/--prefix option.

Does that help?

Thanks for asking, good luck, and feel free to ask again!

oschwengers commented 4 years ago

Hi @dnbaker , thanks a lot for the rapid reply! I tried your example. Maybe I'm missing something, but unfortunately, all dist values in the distmat.tsv are zeros only. In my example I test an Pseudomonas aeruginosa genome against all reference, representative and complete bacterial genomes from RefSeq. So I expect to find some close relatives and many unrelated hits... Surely, I'm doing something wrong but I just don't know what.

Also, it would be nice if hits could be filtered by the max-dist (and e-value) as it is possible in Mash. Once again, thanks a lot for your help. very much appreciated!

dnbaker commented 4 years ago

You might want to check that it's doing what you want --

  1. The default emitted quantity is Jaccard index, not mash distance. To get mash distance, use -M or --mash.
  2. We default to 31 for a kmer length, which is pretty strict. I wonder if that might be the issue; I'd recommend relaxing it to 21 or lower. This depends on how stringent you want to be. (Mash's default is 21, for instance.) If you're looking for broader phylogenetic matches, I'd try 17 or even 14. (You can see this change by trying a few different values.)

We don't currently filter for top hits, and we don't provide an e-value or other statistical test, but we could certainly add such a mode or a mode for maintaining nearest neighbors. I've opened a feature request issue here, and while the next couple of weeks are quite packed, we'll prioritize that as an important feature.

Thanks!

oschwengers commented 4 years ago

Well, I don't have to use the Mash-Distanz. Jaccard index would also be fine. Anyways thanks for the fast reply and your help. I'm looking forward to the new features (#33)!

FYI: I tested the same with k=21 and k=27 and -M, now, all distances are exactly 1. Maybe it would be nice to give some guidance for common scenarios like e.g. the lookup of related bacterial genomes in dbs. Thanks again!