Open tsp-kucbd opened 3 years ago
Sure! It's really easy with Numpy. Here's a script that does what you need:
import numpy as np
def parse_binary(path):
dat = np.fromfile(path, dtype=np.uint8)
assert dat[0] == 0, "Expected the first byte to be zero"
n = int(dat[1:9].view(np.uint64)[0])
vals = dat[9:].view(np.float32)
return n, vals
def tri2full(n, vals, is_sim=True):
mat = np.zeros((n,n), dtype=np.float32)
mat[np.triu_indices(n, k=1)] = vals
mat[np.tril_indices(n, k=-1)] = vals
mat[np.diag_indices(n)] = 1. if is_sim else 0.
return mat
if __name__ == "__main__":
import sys
if sys.argv[1:]:
n, v = parse_binary(sys.argv[1])
print(n, v)
full = tri2full(n, v)
print(full)
The first byte describes the type of the distance matrix. The next 8 are a 64-bit integer for the number of items in the distance matrix, and then the remaining items are float32. You can then use the np.triu_indices
function to export items from packed space into the full matrix. In the above example, I set both halves of the matrix.
I'll add some instructions regarding this into the documentation soon.
Thanks!
Daniel
Thank you, that worked! Although only for smaller runs where I have 10,000 genomes. The current run where I have 400,000 genomes works nicely with "-T" writing the csv files but crashes directly (after 3 minutes) in the beginning when trying to use the binary output "-b"
./dashing_s512 dist -F list_of_genomes -k 15 -S16 -p 39 -M -b -o labelsB --use-nthash -O NGOT.dashdists.bin
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
[1] 40318 abort (core dumped) ./dashing_s512 dist -F list_of_genomes -k 15 -S16 -p 39 -M -b -o labelsB -O
I will open a new issue with this, /T
Is there a way of reading the binary distance matrix from "dashing cmp -b" directly into numpy?