Closed arivers closed 3 years ago
Hi,
You're rather close, though the distances are float32, not float64, and unlike the packed upper-triangular distance, the asymmetric comparison has no bookkeeping in the file. I realize now that usage for asymmetric comparisons (the -Q
option) isn't sufficiently clear, and I'll try to improve it.
The -Q
option is for asymmetric comparisons. Default cmp
comparisons produce upper-triangular distances (if -F
or positional-only arguments are provided), but if -Q
is enabled, then the output shape is (|F|, |Q|).
You would typically only want to use both -Q and -F for asymmetric distances like containment, where f(x_i, x_j) != f(x_j, x_i)
, since otherwise you could just compute the upper-triangular portion of the matrix (which is the behavior from providing only -F
and not -Q
.)
Does this help?
Thanks!
Daniel
Okay thanks!
I knew the actual data was in nt.float.32, I updated my comment to make that clear.
I ended up using Q and F because of this line in the README. "To generate a full, asymmetric distance matrix, provide the same path to -F and -Q."
I tried it both ways and when I ran with --presketched
and -F
with no -Q
and I got this error:
$ dashing_s128 cmp -p1 --presketched -b -Otest1.bin -F sample.txt
Dashing version: v0.5.6
#Path Size (est.)
2021-04-01/GCA/000/007/545/GCA_000007545.1_ASM754v1_genomic.fna.gz.w.31.spacing.16.hll 4727565
...
terminate called after throwing an instance of 'std::system_error'
what(): Unknown error -1: Invalid argument
Aborted (core dumped)
Running with --presketched
, -Q
and -F
works:
$ dashing_s128 cmp -p1 --presketched -b -Otest1.bin -F sample.txt -Q sample.txt
Dashing version: v0.5.6
#Path Size (est.)
2021-04-01/GCA/000/007/545/GCA_000007545.1_ASM754v1_genomic.fna.gz.w.31.spacing.16.hll 4727565
...
I see. That makes sense -- in fact, in the process of investigating the problem today, I ran into the same problem (Unknown error -1), fixed it, and incorporated into a new release which just finished building. Want to give it a try?
Yes, your new release, v0.5.7, fixed the issue. Thanks.
I used Dashing v0.5.6 s128 on a Linux machine to compare pre-hashed genomes. the command was:
From the specification here I was expecting a half matrix output with 1 byte specifying full or half matrix, 8 bytes specifying the length in np.float64, and ((n(n-1)/2)4 bytes of data in npfloat32. Note that supplying
-Q
only for the file path did not work.Instead, I get a file of exactly (n*2)4 bytes so I'm assuming I just got a square matrix of 4-byte float32 values.
The file is 422,393,406,724 bytes for n = 324,959.
I can import the data as a Numpy memory map doing this:
I just wanted to know if this import was correct and also make you aware that the output was not what I expected. I saw in the previous issues you are working on documenting the binary format so I thought I'd pass this along. Overall, Dashing is fantastic and I really appreciate your team's hard work.