dnbaker / dashing

Fast and accurate genomic distances using HyperLogLog
GNU General Public License v3.0
161 stars 11 forks source link

Binary format differs from expectation #73

Closed arivers closed 3 years ago

arivers commented 3 years ago

I used Dashing v0.5.6 s128 on a Linux machine to compare pre-hashed genomes. the command was:

./dashing_s128 cmp -p78 --presketched  -b -Ofull_dashing_S16_k31_dist.bin -F fullpath_hll_filelist.txt -Q fullpath_hll_filelist.txt

From the specification here I was expecting a half matrix output with 1 byte specifying full or half matrix, 8 bytes specifying the length in np.float64, and ((n(n-1)/2)4 bytes of data in npfloat32. Note that supplying -Qonly for the file path did not work.

Instead, I get a file of exactly (n*2)4 bytes so I'm assuming I just got a square matrix of 4-byte float32 values.

The file is 422,393,406,724 bytes for n = 324,959.

I can import the data as a Numpy memory map doing this:

import numpy as np
val = np.memmap('full_dashing_S16_k31_dist.bin', dtype=np.float32, shape=(324959,324959))

I just wanted to know if this import was correct and also make you aware that the output was not what I expected. I saw in the previous issues you are working on documenting the binary format so I thought I'd pass this along. Overall, Dashing is fantastic and I really appreciate your team's hard work.

dnbaker commented 3 years ago

Hi,

You're rather close, though the distances are float32, not float64, and unlike the packed upper-triangular distance, the asymmetric comparison has no bookkeeping in the file. I realize now that usage for asymmetric comparisons (the -Q option) isn't sufficiently clear, and I'll try to improve it.

The -Q option is for asymmetric comparisons. Default cmp comparisons produce upper-triangular distances (if -F or positional-only arguments are provided), but if -Q is enabled, then the output shape is (|F|, |Q|).

You would typically only want to use both -Q and -F for asymmetric distances like containment, where f(x_i, x_j) != f(x_j, x_i), since otherwise you could just compute the upper-triangular portion of the matrix (which is the behavior from providing only -F and not -Q.)

Does this help?

Thanks!

Daniel

arivers commented 3 years ago

Okay thanks!

I knew the actual data was in nt.float.32, I updated my comment to make that clear.

I ended up using Q and F because of this line in the README. "To generate a full, asymmetric distance matrix, provide the same path to -F and -Q."

I tried it both ways and when I ran with --presketched and -F with no -Q and I got this error:

$ dashing_s128 cmp -p1 --presketched  -b -Otest1.bin -F sample.txt
Dashing version: v0.5.6
#Path   Size (est.)
2021-04-01/GCA/000/007/545/GCA_000007545.1_ASM754v1_genomic.fna.gz.w.31.spacing.16.hll  4727565
...
terminate called after throwing an instance of 'std::system_error'
  what():  Unknown error -1: Invalid argument
Aborted (core dumped)

Running with --presketched, -Q and -F works:

$ dashing_s128 cmp -p1 --presketched  -b -Otest1.bin -F sample.txt -Q sample.txt
Dashing version: v0.5.6
#Path   Size (est.)
2021-04-01/GCA/000/007/545/GCA_000007545.1_ASM754v1_genomic.fna.gz.w.31.spacing.16.hll  4727565
...
dnbaker commented 3 years ago

I see. That makes sense -- in fact, in the process of investigating the problem today, I ran into the same problem (Unknown error -1), fixed it, and incorporated into a new release which just finished building. Want to give it a try?

arivers commented 3 years ago

Yes, your new release, v0.5.7, fixed the issue. Thanks.