Open liaoherui opened 3 years ago
Hi there!
This is currently not supported, but it's something you could use Dashing to perform parts of the computation. If you perform sketching with --full-khash-sets, then Dashing will create 64-bit hash sets and write them to disk in a gzip file. The first 8 bytes is a number indicating the number of hashes, and the rest of the file is 64-bit. You could then load the hash sets for A and B, filtering by C.
I see that it could be very useful, and we'll consider supporting it directly as development continues. Let me know if you need any further help or have any question..
Thanks,
Daniel
Hi, Daniel
Got it. Thanks a lot for your prompt reply!
I will try to use the "sketch" function. In this case, is it possible to load these hash files by python? Cause I am not very familiar with c++.
Another not so important question: (Besides, by using the "sketch" function, is it possible to save hash with genome labels like {kmer1->{A, C}, kmer2->{B}} in the future version? If so, it will be really helpful for the development of k-mer-based analysis tool! But I guess it could be time-consuming or is not the target of Dashing. If so, you can ignore this question.)
Regards, Herui Liao
Hi, Daniel
I have just tested the "--full-khash-sets" parameters with one genome. But there is an error.
The command I used is dashing_s128 sketch -p12 --use-full-khash-sets -k31 test.fasta
And the error info is:
I have assigned 200G memory for this job, which should be enough for one small bacterial genome (~5M).
Hi Liao,
You're right, that probably isn't running out of memory. That was a bug, unfortunately, which I've found/fixed in this branch, and it's now been merged into main here.
You can download the new binaries from https://github.com/dnbaker/dashing/tree/main/release/.
You could do a lot of functionality from within Python. To parse each of these k-mer files, here's some Python code:
def parse_khs(path):
import numpy as np
import gzip
data = np.frombuffer(gzip.open(path, "r").read(), dtype=np.uint64)
ld = data[0]
data = data[1:]
assert ld == len(data)
return data
This yields a hash set in vector form. So after using the fixed/rebuilt code to eliminate the segmentation fault, you might try something like this:
def filtered_jaccard(a, b, c):
adata, bdata, cdata = map(parse_khs, (a, b, c))
cset = set(c)
afilt = set(adata) - cset
bfilt = set(bdata) - cset
isz = afilt & bfilt
union = afilt | bfilt
return len(isz) / len(union)
Hi, Daniel
Thanks a lot for your suggestions and the new version of dashing!
However, when I tried this new binary, it seems there are still some problems. As shown in the picture below, the new version can not be executed while the old version works well.
The same with s256 and s512.
Thanks for letting me know.
What's your operating system?
You could check if it has permissions (cmhod 755 dashing_s128
), in case that fixes it. Otherwise, maybe cloning fresh and building manually might help?
git clone --recursive https://github.com/dnbaker/dashing
cd dashing && make dashing
You can then install it either with sudo make install
or by manually copying the executable into a folder in the PATH environmental variable.
Thanks! Will try.
My operation system is Linux, and the version info is shown below.
By the way - while there's no such functionality in this software, I added to Dashing2 a feature which supports something like this. The feature uses the flag --filterset <path>
, which reads all k-mers from that path and then skips them when sketching from the remaining files.
And your request inspired me to provide it - so thank you!
Firstly, thanks a lot for your excellent tool! It's really cool!
This is my question:
Currently, I have three files, A.fq, B.fa, C.fa. And I want to calculate the distance between A.fq and B.fa, but not consider k-mers in C.fa. In other words, I need to do dist(A-C, B-C). I am not sure whether dashing can be used to do this?
If it works, how about many "B", I mean, dist(A-C, "genome_path"-C), where "genome_path" refers to many fasta files (just like the input file (genome_path.txt) of dist function of dashing), not only one B.
Thanks a lot for your answering in advance!