Closed tsp-kucbd closed 1 year ago
Hi,
Thanks both for the bug report and making it easy to reproduce. We'll get this looked into and patched up in the next day or so.
Best,
Daniel
Hi - I've found the issue.
In the last line for Dashing2, you need to specify the k-mer length and sketch size again. Dashing2 was defaulting to 31, while Dashing1 was using k = 15.
> dashing2 cmp -p 1 --cmpout /dev/stdout -F list_of_genomes -Q list_of_genomes --mash-distance -k 15
#Calling Dashing2 version v2.1.11-4-gd23a with command '/Users/dnb13/Desktop/code/dashing2/dashing2 cmp -p 1 --cmpout /dev/stdout -F list_of_genomes -Q list_of_genomes --mash-distance -k 15'
#Dashing2 Panel (Query/Refernce) Output
#Dashing2Options: Dashing2Options;k:15;parsebyfile;trimchr;sketchsize:1024;sketchtype:onepermsetsketch;Fastx;canon
#Sources KC139520.fasta MZ375344.fasta KC139520.fasta MZ375344.fasta
KC139520.fasta -0 0.26303053
MZ375344.fasta 0.26303053 -0
(base) Daniels-MacBook-Pro:dashing2 dnb13$ dashing cmp -p 1 -k15 -S16 -F list_of_genomes -Q list_of_genomes -T -O /dev/stdout -o result.labels -M
Dashing version: v1.0-5-g2afa
KC139520.fasta -0 0.252614
MZ375344.fasta 0.252614 -0
So, to make the above work, you would need to modify the last Dashing2 call to make sure it's using the same -S parameter and the same -k parameter.
Could you give that a try?
Thanks,
Daniel
unfortunately it did not work ... When only using "-k 15" in the second Dashing2 call, I get sensible values - but I can see that it created new sketches with the default sketch size 1024.
#Dashing2Options: Dashing2Options;k:15;parsebyfile;trimchr;sketchsize:1024;sketchtype:onepermsetsketch;Fastx;canon
#Sources genomeA.fasta genomeB.fasta genomeA.fasta genomeB.fasta
genomeA.fasta -0 0.26303053
genomeB.fasta 0.26303053 -0
If I use both parameters in the second call "-k 15 -S 65536" I get inf values
#Dashing2Options: Dashing2Options;k:15;parsebyfile;trimchr;sketchsize:65536;sketchtype:onepermsetsketch;Fastx;canon
#Sources genomeA.fasta genomeB.fasta genomeA.fasta genomeB.fasta
genomeA.fasta -0 inf
genomeB.fasta inf -0
Hi again,
It's been a long time, but I was finally able to track that down. That was a bug introduced when trying to add M1 support properly.
I've patched it up in https://github.com/dnbaker/dashing2/pull/72, and updated binaries are https://github.com/dnbaker/dashing2-binaries/tree/main/linux/v2.1.14.
Happy to reopen as-needed, but I'm closing for now as it was a lot of work to track down.
Thanks again!
Daniel
Thank you for looking into this. Unfortunately, using the toy example "-k 15 -S 65536" still returns inf values.
#Dashing2 Panel (Query/Refernce) Output
#Dashing2Options: Dashing2Options;k:15;parsebyfile;trimchr;sketchsize:65536;sketchtype:onepermsetsketch;Fastx;canon
#Sources genomeA.fasta genomeB.fasta genomeA.fasta genomeB.fasta
genomeA.fasta -0 inf
genomeB.fasta inf -0
Testing different sketch sizes, it seems that after 4630 inf values show up
-S4630 works
dashing2 cmp -p 1 -F list_of_genomes -Q list_of_genomes --mash-distance -k 15 -S 4630
#Dashing2 Panel (Query/Refernce) Output
#Dashing2Options: Dashing2Options;k:15;parsebyfile;trimchr;sketchsize:4630;sketchtype:onepermsetsketch;Fastx;canon
#Sources genomeA.fasta genomeB.fasta genomeA.fasta genomeB.fasta
genomeA.fasta -0 0.26334497
genomeB.fasta 0.26334497 -0
-S4631 does not work anymore
dashing2 cmp -p 1 -F list_of_genomes -Q list_of_genomes --mash-distance -k 15 -S 4631
#Dashing2 Panel (Query/Refernce) Output
#Dashing2Options: Dashing2Options;k:15;parsebyfile;trimchr;sketchsize:4631;sketchtype:onepermsetsketch;Fastx;canon
#Sources genomeA.fasta genomeB.fasta genomeA.fasta genomeB.fasta
genomeA.fasta -0 inf
genomeB.fasta inf -0
It happens both with the binary and the compiled version. /T
Thanks for checking! It seems that for large sketches/small inputs, the default one-permutation setsketch approach is running into empty registers and something is going wrong when accounting for it.
You can add --full-setsketch
to the comparison for now to use a slower sketching method that won't have this issue. I'll think about how to handle this issue in the long-term after.
Best,
Daniel
I am puzzled about the (for us) large dashing1/dashing2 differences for mash distances for some genome pairs. In the example below with dashing1 and mash we get values around 0.25 - which reflects that these genomes are not related.
However, dashing2 reports a much lower distance value of 0.11 which strongly suggests a relationship - which is not there.
Is there a way to reproduce dashing1 results with dashing2?
Below are the steps to reproduce with the genomes from NCBI genomeA https://www.ncbi.nlm.nih.gov/nuccore/KC139520.1?report=fasta
genomeB https://www.ncbi.nlm.nih.gov/nuccore/MZ375344.1?report=fasta