dnbaker / dashing2

Dashing 2 is a fast toolkit for k-mer and minimizer encoding, sketching, comparison, and indexing.
MIT License
62 stars 7 forks source link

dashing2 for metagenome #71

Open jianshu93 opened 1 year ago

jianshu93 commented 1 year ago

Hello Daniel,

I am comparing dashing2 with bindash and Mash for metagenome. I am well aware of the fact that canonical kmer was used in Mash, so that for metagenomic reads (always pair end due to sequencing), pair-end reads can be merged into one single reads by overlap detection, so that we do not need to process so many reads but only half of them since it is the same if we use canonical k-mer. I did not see a suggestion from Mash or dashing to do merge first (very fast), then we can reduce computation time to half without changing results at all. what do you think

Thanks,

Jianshu

dnbaker commented 1 year ago

Hi Jianshu -

Interesting. Yes, you can collapse them together. A lot depends on if the two ends overlap with each other. You can safely concatenate the sequences with an N between - Dashing and Dashing2 will mask any k-mers with unknown k-mers, so you'll end up with one k-mer set for the paired-end reads.

Normally you could just concatenate the files directly since they end up in the same bucket. But you are right, any preprocessing can make things smaller.

And to check in about DartMinHash - I've worked on incorporating its weighted minhashing scheme but I haven't had time to test accuracy results and merge it in. If it helps us with weighted sketching, it would really help cut out the costs of --bagminhash weighted sketching.

Sorry for the delay!

Thanks,

Daniel