dnbaker / dashing

Fast and accurate genomic distances using HyperLogLog
GNU General Public License v3.0
160 stars 11 forks source link

The subtraction function of Dashing #80

Open liaoherui opened 3 years ago

liaoherui commented 3 years ago

Firstly, thanks a lot for your excellent tool! It's really cool!

This is my question:

Currently, I have three files, A.fq, B.fa, C.fa. And I want to calculate the distance between A.fq and B.fa, but not consider k-mers in C.fa. In other words, I need to do dist(A-C, B-C). I am not sure whether dashing can be used to do this?

If it works, how about many "B", I mean, dist(A-C, "genome_path"-C), where "genome_path" refers to many fasta files (just like the input file (genome_path.txt) of dist function of dashing), not only one B.

Thanks a lot for your answering in advance!

dnbaker commented 3 years ago

Hi there!

This is currently not supported, but it's something you could use Dashing to perform parts of the computation. If you perform sketching with --full-khash-sets, then Dashing will create 64-bit hash sets and write them to disk in a gzip file. The first 8 bytes is a number indicating the number of hashes, and the rest of the file is 64-bit. You could then load the hash sets for A and B, filtering by C.

I see that it could be very useful, and we'll consider supporting it directly as development continues. Let me know if you need any further help or have any question..

Thanks,

Daniel

liaoherui commented 3 years ago

Hi, Daniel

Got it. Thanks a lot for your prompt reply!

I will try to use the "sketch" function. In this case, is it possible to load these hash files by python? Cause I am not very familiar with c++.

Another not so important question: (Besides, by using the "sketch" function, is it possible to save hash with genome labels like {kmer1->{A, C}, kmer2->{B}} in the future version? If so, it will be really helpful for the development of k-mer-based analysis tool! But I guess it could be time-consuming or is not the target of Dashing. If so, you can ignore this question.)

Regards, Herui Liao

liaoherui commented 3 years ago

Hi, Daniel

I have just tested the "--full-khash-sets" parameters with one genome. But there is an error.

The command I used is dashing_s128 sketch -p12 --use-full-khash-sets -k31 test.fasta

And the error info is: image

I have assigned 200G memory for this job, which should be enough for one small bacterial genome (~5M).

dnbaker commented 3 years ago

Hi Liao,

You're right, that probably isn't running out of memory. That was a bug, unfortunately, which I've found/fixed in this branch, and it's now been merged into main here.

You can download the new binaries from https://github.com/dnbaker/dashing/tree/main/release/.

You could do a lot of functionality from within Python. To parse each of these k-mer files, here's some Python code:

def parse_khs(path):
    import numpy as np
    import gzip
    data = np.frombuffer(gzip.open(path, "r").read(), dtype=np.uint64)
    ld = data[0]
    data = data[1:]
    assert ld == len(data)
    return data

This yields a hash set in vector form. So after using the fixed/rebuilt code to eliminate the segmentation fault, you might try something like this:

def filtered_jaccard(a, b, c):
    adata, bdata, cdata = map(parse_khs, (a, b, c))
    cset = set(c)
    afilt = set(adata) - cset
    bfilt = set(bdata) - cset
    isz = afilt & bfilt
    union = afilt | bfilt
    return len(isz) / len(union)
liaoherui commented 3 years ago

Hi, Daniel

Thanks a lot for your suggestions and the new version of dashing!

However, when I tried this new binary, it seems there are still some problems. As shown in the picture below, the new version can not be executed while the old version works well.

image

The same with s256 and s512.

dnbaker commented 3 years ago

Thanks for letting me know.

What's your operating system?

You could check if it has permissions (cmhod 755 dashing_s128), in case that fixes it. Otherwise, maybe cloning fresh and building manually might help?

git clone --recursive https://github.com/dnbaker/dashing
cd dashing && make dashing

You can then install it either with sudo make install or by manually copying the executable into a folder in the PATH environmental variable.

liaoherui commented 3 years ago

Thanks! Will try.

My operation system is Linux, and the version info is shown below. image

dnbaker commented 3 years ago

By the way - while there's no such functionality in this software, I added to Dashing2 a feature which supports something like this. The feature uses the flag --filterset <path>, which reads all k-mers from that path and then skips them when sketching from the remaining files.

And your request inspired me to provide it - so thank you!