Closed gaboentropy closed 3 years ago
Thank you! That's very helpful.
I'm still verifying details, but I think it can be explained by the order in which the genomes are processed.
Dashing1 sorts the files by size and processes the biggest ones first. For big collections with a few large genomes, that improved runtime. That results in a different ordering of the genomes. You can disable this in Dashing1 with --avoid-sorting
. In Dashing2, we process them in order of decreasing size without reordering them. Looking at your two tables, it seems that the results are nearly the same, but ordered differently.
The output format is upper-triangular PHYLIP by default, which might be confusing. To generate a full matrix, use the flag --asymmetric-all-pairs
, which I realize wasn't added to the documentation.
I'll let you know when I'm sure.
And one more comment - there's a rather big change between Dashing1 and Dashing2 in the sketch size parameter - now it's <arg>
registers, rather than 2^<arg>
; the power-of-two size is no longer a requirement, which makes choosing sketch sizes a lot more flexible, but you now have to specify the actual size, e.g., -S1048576
in Dashing2 to match -S20
with dashing1.
Thanks for the answer Dan.
Just to be clear about my problem with the matrix:
If we clean up the file names a bit for clarity:
% grep GCF test.dist | grep -v Dashing2Options | perl -pe 's{\S+/}{}g ; s{\.fna\S+}{}g'
#Sources GCF_000005845 GCF_000006925 GCF_000009885
GCF_000005845 0.484191895 0.0108032227
GCF_000006925 0.0113525391
GCF_000009885
As you can see, if the first row here (starting with a "#") is the header, then that would mean that the 0.48 corresponds to GCF_000005845 compared to itself.
Using dashing 1 with the same genome sequences:
% perl -pe 's{\S+/}{}g ; s{\.fna\S+}{}g' test.matriz
##Names GCF_000009885 GCF_000006925 GCF_000005845
GCF_000009885 - 0.0136386 0.0146923
GCF_000006925 - - 0.475225
GCF_000005845 - - -
It would seem like the 0.48 corresponds to GCF_000005845 compared to GCF_000006925 (if we look at the row for GCF_000006925, the third value is the closest to 0.48, and the third label in the header is GCF_000005845).
Best and thanks for your efforts, the software is great.
P.S. The full matrix does solve the header problem:
% ../dashing2/bin/dashing2 sketch -k 21 -S 16384 --cmpout test.full -F test.list --asymmetric-all-pairs
[Dashing2] Invocation: /Users/gmh/TestGenomes/../dashing2/bin/dashing2 sketch -k 21 -S 16384 --cmpout test.full -F test.list --asymmetric-all-pairs Dashing2 made with k = 21, w = -1, DNA target, space = SetSpace, datatype = Fastx and result = FullSetSketch
Nothing written to disk, as no output file provided.
Emitting human-readable: HumanReadable
Batch 0, ranges from 0 to 3
% grep GCF test.full | grep -v Dashing2Options | perl -pe 's{\S+/}{}g ; s{\.fna\S+}{}g;'
#Sources GCF_000005845 GCF_000006925 GCF_000009885
GCF_000005845 1 0.484191895 0.0108032227
GCF_000006925 0.484191895 1 0.0113525391
GCF_000009885 0.0108032227 0.0113525391 1
Great!
Thanks again for your feedback.
It was a poor choice to change the default output format, and I've now reverted to Dashing 1's default output format in the current main branch, which I'll integrate into the next binary release.
To get the old default (PHYLIP), you now have to opt-in with --phylip
.
Let me know if you have any further issues or questions!
Best,
Daniel
Thanks!
Hi Dan,
While updating my version of dashing I saw this one. I successfully compiled it with g++-11. However, the matrix is weird, twisted, hard to tell what goes with what:
Self with self should be 1 or - (if this is jaccard), but there's no such thing, and, instead, the self with self seems to have results for something else. For example GCF_000005845 with GCF_000005845 has a result of 0.484191895
With dashing these are correctly positioned:
There's also an inf result (maybe should be 1) with --mash when genomes are very ... different? but for that the matrix is larger and harder to show here.
I know it's still under development, so I'll be patient. I just wanted you to know these issues before it becomes too complicated.