dnbaker / dashing2

Dashing 2 is a fast toolkit for k-mer and minimizer encoding, sketching, comparison, and indexing.
MIT License
62 stars 7 forks source link

matrix wrong format (?) #27

Closed gaboentropy closed 3 years ago

gaboentropy commented 3 years ago

Hi Dan,

While updating my version of dashing I saw this one. I successfully compiled it with g++-11. However, the matrix is weird, twisted, hard to tell what goes with what:

% ../dashing2/bin/dashing2 sketch -k 21 -S 16384 --cmpout test.dist -F test.list
[Dashing2] Invocation: /Users/gmh/TestGenomes/../dashing2/bin/dashing2 sketch -k 21 -S 16384 --cmpout test.dist -F test.list Dashing2 made with k = 21, w = -1, DNA target, space = SetSpace, datatype = Fastx and result = FullSetSketch
Nothing written to disk, as no output file provided.
Emitting human-readable: HumanReadable
%
% cat test.dist 
#Dashing2 PHYLIP pairwise Output
#Dashing2Options: Dashing2Options;k:21;parsebyfile;trimchr;sketchsize:16384;sketchtype:fullsetsketch;Fastx;canon;command:"/Users/gmh/TestGenomes/../dashing2/bin/dashing2 sketch -k 21 -S 16384 --cmpout test.dist -F test.list"
#Sources    FNA/GCF_000005845.fna.gz    FNA/GCF_000006925.fna.gz    FNA/GCF_000009885.fna.gz
3
FNA/GCF_000005845.fna.gz    0.484191895 0.0108032227
FNA/GCF_000006925.fna.gz    0.0113525391
FNA/GCF_000009885.fna.gz

Self with self should be 1 or - (if this is jaccard), but there's no such thing, and, instead, the self with self seems to have results for something else. For example GCF_000005845 with GCF_000005845 has a result of 0.484191895

With dashing these are correctly positioned:

% dashing cmp -k 21 -S 20 -O test.dist -F test.list
Dashing version: v1.draft-3-g90f0
#Path   Size (est.)
FNA/GCF_000009885.fna.gz    5394357
FNA/GCF_000006925.fna.gz    4370118
FNA/GCF_000005845.fna.gz    4544097
%
% cat test.dist 
##Names FNA/GCF_000009885.fna.gz    FNA/GCF_000006925.fna.gz    FNA/GCF_000005845.fna.gz
FNA/GCF_000009885.fna.gz    -   0.010112    0.00974668
FNA/GCF_000006925.fna.gz    -   -   0.479668
FNA/GCF_000005845.fna.gz    -   -   -

There's also an inf result (maybe should be 1) with --mash when genomes are very ... different? but for that the matrix is larger and harder to show here.

I know it's still under development, so I'll be patient. I just wanted you to know these issues before it becomes too complicated.

dnbaker commented 3 years ago

Thank you! That's very helpful.

I'm still verifying details, but I think it can be explained by the order in which the genomes are processed.

Dashing1 sorts the files by size and processes the biggest ones first. For big collections with a few large genomes, that improved runtime. That results in a different ordering of the genomes. You can disable this in Dashing1 with --avoid-sorting. In Dashing2, we process them in order of decreasing size without reordering them. Looking at your two tables, it seems that the results are nearly the same, but ordered differently.

The output format is upper-triangular PHYLIP by default, which might be confusing. To generate a full matrix, use the flag --asymmetric-all-pairs, which I realize wasn't added to the documentation.

I'll let you know when I'm sure.

And one more comment - there's a rather big change between Dashing1 and Dashing2 in the sketch size parameter - now it's <arg> registers, rather than 2^<arg>; the power-of-two size is no longer a requirement, which makes choosing sketch sizes a lot more flexible, but you now have to specify the actual size, e.g., -S1048576 in Dashing2 to match -S20 with dashing1.

gaboentropy commented 3 years ago

Thanks for the answer Dan.

Just to be clear about my problem with the matrix:

If we clean up the file names a bit for clarity:

% grep GCF test.dist | grep -v Dashing2Options | perl -pe 's{\S+/}{}g ; s{\.fna\S+}{}g'
#Sources    GCF_000005845   GCF_000006925   GCF_000009885
GCF_000005845   0.484191895 0.0108032227
GCF_000006925   0.0113525391
GCF_000009885

As you can see, if the first row here (starting with a "#") is the header, then that would mean that the 0.48 corresponds to GCF_000005845 compared to itself.

Using dashing 1 with the same genome sequences:

% perl -pe 's{\S+/}{}g ; s{\.fna\S+}{}g' test.matriz
##Names GCF_000009885   GCF_000006925   GCF_000005845
GCF_000009885   -   0.0136386   0.0146923
GCF_000006925   -   -   0.475225
GCF_000005845   -   -   -

It would seem like the 0.48 corresponds to GCF_000005845 compared to GCF_000006925 (if we look at the row for GCF_000006925, the third value is the closest to 0.48, and the third label in the header is GCF_000005845).

Best and thanks for your efforts, the software is great.


P.S. The full matrix does solve the header problem:

% ../dashing2/bin/dashing2 sketch -k 21 -S 16384 --cmpout test.full -F test.list --asymmetric-all-pairs
[Dashing2] Invocation: /Users/gmh/TestGenomes/../dashing2/bin/dashing2 sketch -k 21 -S 16384 --cmpout test.full -F test.list --asymmetric-all-pairs Dashing2 made with k = 21, w = -1, DNA target, space = SetSpace, datatype = Fastx and result = FullSetSketch
Nothing written to disk, as no output file provided.
Emitting human-readable: HumanReadable
Batch 0, ranges from 0 to 3
% grep GCF test.full | grep -v Dashing2Options | perl -pe 's{\S+/}{}g ; s{\.fna\S+}{}g;'
#Sources    GCF_000005845   GCF_000006925   GCF_000009885
GCF_000005845   1   0.484191895 0.0108032227
GCF_000006925   0.484191895 1   0.0113525391
GCF_000009885   0.0108032227    0.0113525391    1

Great!

dnbaker commented 3 years ago

Thanks again for your feedback.

It was a poor choice to change the default output format, and I've now reverted to Dashing 1's default output format in the current main branch, which I'll integrate into the next binary release.

To get the old default (PHYLIP), you now have to opt-in with --phylip.

Let me know if you have any further issues or questions!

Best,

Daniel

gaboentropy commented 3 years ago

Thanks!