dnbaker / dashing2

Dashing 2 is a fast toolkit for k-mer and minimizer encoding, sketching, comparison, and indexing.
MIT License
62 stars 7 forks source link

Differences in mash distances between dashing2 vs dashing1/mash #63

Closed tsp-kucbd closed 1 year ago

tsp-kucbd commented 2 years ago

I am puzzled about the (for us) large dashing1/dashing2 differences for mash distances for some genome pairs. In the example below with dashing1 and mash we get values around 0.25 - which reflects that these genomes are not related.

However, dashing2 reports a much lower distance value of 0.11 which strongly suggests a relationship - which is not there.

Is there a way to reproduce dashing1 results with dashing2?

Below are the steps to reproduce with the genomes from NCBI genomeA https://www.ncbi.nlm.nih.gov/nuccore/KC139520.1?report=fasta

genomeB https://www.ncbi.nlm.nih.gov/nuccore/MZ375344.1?report=fasta

dashing sketch -k 15 -S16 -p 1 -F list_of_genomes
dashing cmp -p 1  -M -W -k15 -S16 -F list_of_genomes -Q list_of_genomes  -T -O result.tab -o result.labels

cat result.tab
genomeA.fasta   -0      0.252614
genomeB.fasta   0.252614        -0

dashing2 sketch -p 1 -k 15 -S 65536 -F list_of_genomes

dashing2 cmp -p 1 --cmpout result_D2.out  -F list_of_genomes -Q list_of_genomes --mash-distance

cat result_D2.out
#Dashing2 Panel (Query/Refernce) Output
#Dashing2Options: Dashing2Options;k:32;parsebyfile;trimchr;sketchsize:1024;sketchtype:onepermsetsketch;Fastx;canon
#Sources        genomeA.fasta   genomeB.fasta   genomeA.fasta   genomeB.fasta
genomeA.fasta   -0      0.115187205
genomeB.fasta   0.115187205     -0
dnbaker commented 2 years ago

Hi,

Thanks both for the bug report and making it easy to reproduce. We'll get this looked into and patched up in the next day or so.

Best,

Daniel

dnbaker commented 2 years ago

Hi - I've found the issue.

In the last line for Dashing2, you need to specify the k-mer length and sketch size again. Dashing2 was defaulting to 31, while Dashing1 was using k = 15.

> dashing2 cmp -p 1 --cmpout /dev/stdout -F list_of_genomes -Q list_of_genomes --mash-distance -k 15
#Calling Dashing2 version v2.1.11-4-gd23a with command '/Users/dnb13/Desktop/code/dashing2/dashing2 cmp -p 1 --cmpout /dev/stdout -F list_of_genomes -Q list_of_genomes --mash-distance -k 15'
#Dashing2 Panel (Query/Refernce) Output
#Dashing2Options: Dashing2Options;k:15;parsebyfile;trimchr;sketchsize:1024;sketchtype:onepermsetsketch;Fastx;canon
#Sources    KC139520.fasta  MZ375344.fasta  KC139520.fasta  MZ375344.fasta
KC139520.fasta  -0  0.26303053
MZ375344.fasta  0.26303053  -0
(base) Daniels-MacBook-Pro:dashing2 dnb13$ dashing cmp -p 1 -k15 -S16 -F list_of_genomes -Q list_of_genomes  -T -O /dev/stdout -o result.labels -M
Dashing version: v1.0-5-g2afa
KC139520.fasta  -0  0.252614
MZ375344.fasta  0.252614    -0

So, to make the above work, you would need to modify the last Dashing2 call to make sure it's using the same -S parameter and the same -k parameter.

Could you give that a try?

Thanks,

Daniel

tsp-kucbd commented 2 years ago

unfortunately it did not work ... When only using "-k 15" in the second Dashing2 call, I get sensible values - but I can see that it created new sketches with the default sketch size 1024.

#Dashing2Options: Dashing2Options;k:15;parsebyfile;trimchr;sketchsize:1024;sketchtype:onepermsetsketch;Fastx;canon
#Sources        genomeA.fasta   genomeB.fasta   genomeA.fasta   genomeB.fasta
genomeA.fasta   -0      0.26303053
genomeB.fasta   0.26303053      -0

If I use both parameters in the second call "-k 15 -S 65536" I get inf values

#Dashing2Options: Dashing2Options;k:15;parsebyfile;trimchr;sketchsize:65536;sketchtype:onepermsetsketch;Fastx;canon
#Sources        genomeA.fasta   genomeB.fasta   genomeA.fasta   genomeB.fasta
genomeA.fasta   -0      inf
genomeB.fasta   inf     -0
dnbaker commented 1 year ago

Hi again,

It's been a long time, but I was finally able to track that down. That was a bug introduced when trying to add M1 support properly.

I've patched it up in https://github.com/dnbaker/dashing2/pull/72, and updated binaries are https://github.com/dnbaker/dashing2-binaries/tree/main/linux/v2.1.14.

Happy to reopen as-needed, but I'm closing for now as it was a lot of work to track down.

Thanks again!

Daniel

tsp-kucbd commented 1 year ago

Thank you for looking into this. Unfortunately, using the toy example "-k 15 -S 65536" still returns inf values.

#Dashing2 Panel (Query/Refernce) Output
#Dashing2Options: Dashing2Options;k:15;parsebyfile;trimchr;sketchsize:65536;sketchtype:onepermsetsketch;Fastx;canon
#Sources        genomeA.fasta   genomeB.fasta   genomeA.fasta   genomeB.fasta
genomeA.fasta   -0      inf
genomeB.fasta   inf     -0

Testing different sketch sizes, it seems that after 4630 inf values show up

-S4630 works

dashing2 cmp -p 1   -F list_of_genomes -Q list_of_genomes --mash-distance  -k 15  -S 4630
#Dashing2 Panel (Query/Refernce) Output
#Dashing2Options: Dashing2Options;k:15;parsebyfile;trimchr;sketchsize:4630;sketchtype:onepermsetsketch;Fastx;canon
#Sources        genomeA.fasta   genomeB.fasta   genomeA.fasta   genomeB.fasta
genomeA.fasta   -0      0.26334497
genomeB.fasta   0.26334497      -0

-S4631 does not work anymore

dashing2 cmp -p 1   -F list_of_genomes -Q list_of_genomes --mash-distance  -k 15  -S 4631
#Dashing2 Panel (Query/Refernce) Output
#Dashing2Options: Dashing2Options;k:15;parsebyfile;trimchr;sketchsize:4631;sketchtype:onepermsetsketch;Fastx;canon
#Sources        genomeA.fasta   genomeB.fasta   genomeA.fasta   genomeB.fasta
genomeA.fasta   -0      inf
genomeB.fasta   inf     -0

It happens both with the binary and the compiled version. /T

dnbaker commented 1 year ago

Thanks for checking! It seems that for large sketches/small inputs, the default one-permutation setsketch approach is running into empty registers and something is going wrong when accounting for it.

You can add --full-setsketch to the comparison for now to use a slower sketching method that won't have this issue. I'll think about how to handle this issue in the long-term after.

Best,

Daniel