dnbaker / dashing

Fast and accurate genomic distances using HyperLogLog
GNU General Public License v3.0
160 stars 11 forks source link

reading binary output from cmp #63

Open tsp-kucbd opened 3 years ago

tsp-kucbd commented 3 years ago

Is there a way of reading the binary distance matrix from "dashing cmp -b" directly into numpy?

dnbaker commented 3 years ago

Sure! It's really easy with Numpy. Here's a script that does what you need:

import numpy as np

def parse_binary(path):
    dat = np.fromfile(path, dtype=np.uint8)
    assert dat[0] == 0, "Expected the first byte to be zero"
    n = int(dat[1:9].view(np.uint64)[0])
    vals = dat[9:].view(np.float32)
    return n, vals

def tri2full(n, vals, is_sim=True):
    mat = np.zeros((n,n), dtype=np.float32)
    mat[np.triu_indices(n, k=1)] = vals
    mat[np.tril_indices(n, k=-1)] = vals
    mat[np.diag_indices(n)] = 1. if is_sim else 0.
    return mat

if __name__ == "__main__":
    import sys
    if sys.argv[1:]:
        n, v = parse_binary(sys.argv[1])
        print(n, v)
        full = tri2full(n, v)
        print(full)

The first byte describes the type of the distance matrix. The next 8 are a 64-bit integer for the number of items in the distance matrix, and then the remaining items are float32. You can then use the np.triu_indices function to export items from packed space into the full matrix. In the above example, I set both halves of the matrix.

I'll add some instructions regarding this into the documentation soon.

Thanks!

Daniel

tsp-kucbd commented 3 years ago

Thank you, that worked! Although only for smaller runs where I have 10,000 genomes. The current run where I have 400,000 genomes works nicely with "-T" writing the csv files but crashes directly (after 3 minutes) in the beginning when trying to use the binary output "-b"

./dashing_s512 dist -F list_of_genomes  -k 15 -S16 -p 39 -M -b -o labelsB --use-nthash -O  NGOT.dashdists.bin

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
[1]    40318 abort (core dumped)  ./dashing_s512 dist -F list_of_genomes -k 15 -S16 -p 39 -M -b -o labelsB  -O

I will open a new issue with this, /T