dnbaker / dashing2

Dashing 2 is a fast toolkit for k-mer and minimizer encoding, sketching, comparison, and indexing.
MIT License
62 stars 7 forks source link

no dashing2 cmp help info #51

Open jianshu93 opened 2 years ago

jianshu93 commented 2 years ago

Hello Daniel,

I compute sketches using dashing2 sketch and store sketch using - - outfile. There is a sketch and a sketch.name.txt. But I am not sure how to feed those to dashing2 cmp since no help is provided. I looked into the code and it confuses me. I can use -cmpout to have the distance but I want to check how long sketching and cmp take.

Thanks,

Jianshu

jianshu93 commented 2 years ago

Hello Daniel,

time dashing2 sketch -k 11 -S 12000 --threads 24 --pminhash --topk 250 --cmpout phage_GPD_topK_250.txt -Q name.txt -F reference.txt

With and without the --topk 250 option, I have exactly the same output. Am I making a mistake? I am using the newest v2.1.11 for 512bw. Forgive me the sketch command read me is a little bit long/confusing.

Thanks,

Jianshu

dnbaker commented 2 years ago

Hi Jianshu,

Sorry for the wait! It's been a busy couple of weeks.

I've added in cmp usage (https://github.com/dnbaker/dashing2/commit/3b71c9cdb925aa582921e89a7cc66f62f773f9d4), so thank you for pointing that out.

You can pass sketched files to cmp, but you have to sketch all the original input files together. For example, something like this:

dashing2 sketch  <sketching options> -F filelist.txt -o stacked_sketch_file.
dashing2 cmp <sketching options> --presketched stacked_sketch_file.rc_canon.sketchsize1024.k32.SetSpace.DNA.opss 

This way, it's broken into two stages (which you can time). This also makes it easier to work with larger collections, since the sketch matrix can be memory-mapped and therefore exceed system RAM.

Does that help?

I'll look into the topk 250 option results as well. I'm not sure what's going on there, but I'll let you know.

Thanks,

Daniel