Closed oschwengers closed 4 years ago
Sure!
Let's say you have a list of reference genomes in fnames.txt and a list of query fastas/assemblies in qnames.txt, and that you set kmerlen and numthreads as desired. First, you'd sketch the references by:
dashing sketch -F fnames.txt -k $kmerlen -p $numthreads # Optional: spacing pattern for the seed
This then caches a sketch for each file. Then, you get a list of distances by:
dashing dist -k $kmerlen -p $numthreads -F fnames.txt -Q qnames.txt -osizes.txt -Odistmat.tsv
If you pass -C
as well, then dashing will cache sketches for the query genomes. You can change which directory the cached sketches are stored in with the -P/--prefix
option.
Does that help?
Thanks for asking, good luck, and feel free to ask again!
Hi @dnbaker , thanks a lot for the rapid reply! I tried your example. Maybe I'm missing something, but unfortunately, all dist values in the distmat.tsv are zeros only. In my example I test an Pseudomonas aeruginosa genome against all reference, representative and complete bacterial genomes from RefSeq. So I expect to find some close relatives and many unrelated hits... Surely, I'm doing something wrong but I just don't know what.
Also, it would be nice if hits could be filtered by the max-dist (and e-value) as it is possible in Mash
.
Once again, thanks a lot for your help. very much appreciated!
You might want to check that it's doing what you want --
Jaccard index
, not mash distance. To get mash distance, use -M
or --mash
.We don't currently filter for top hits, and we don't provide an e-value or other statistical test, but we could certainly add such a mode or a mode for maintaining nearest neighbors. I've opened a feature request issue here, and while the next couple of weeks are quite packed, we'll prioritize that as an important feature.
Thanks!
Well, I don't have to use the Mash-Distanz
. Jaccard index
would also be fine.
Anyways thanks for the fast reply and your help. I'm looking forward to the new features (#33)!
FYI: I tested the same with k=21
and k=27
and -M
, now, all distances are exactly 1. Maybe it would be nice to give some guidance for common scenarios like e.g. the lookup of related bacterial genomes in dbs.
Thanks again!
Hi, thanks for this fast alternative! I'd like to test dashing as a replacement for
Mash
in an inhouse-pipeline and maybe to replaceKraken
in https://github.com/oschwengers/asap.I'm a little bit puzzled by the different subcommands and parameters. What is the best way to:
fasta
files (bacterial genomes, 1 genome per file)Could you comment on this and maybe provide a short list of commands? Thanks a lot and best regards!