Closed mihkelvaher closed 4 years ago
Well, it depends on exactly what you want to do.
To make a set of sketches for a set of genomes, using dashing sketch -p${threads} --use-bloom-filter -F${fnames} -S${log2_sketch_size}
, at which point a bloom filter will be cached for each of them.
Let's say big_list.txt
is a list of fastas and small_list1.txt
and small_list2.txt
lists of smaller fastas.
The default Dashing comparison is all vs all, but to compare set A against set B, use the -Q, --query-paths
argument:
dashing dist -Q query_paths.txt -F reference_paths.txt --use-bloom-filter --sizes
with emit the rectangular matrix of union sizes between paths in query_paths.txt and paths in reference_paths.txt, which you could subtract the estimated cardinality of the genome itself from to get the intersection, or --containment-index
to get $\frac{|A \cap B|}{|A|}$
and multiply by that cardinality. We don't currently support intersection size, but I'd like to add that.
For this, you'd get an estimate of the number of shared kmers. To get an exact answer, --use-full-khash-sets works for all of these methods as well, but I don't quite see how the bloom filters would come into play this way.
I imagine there's probably an easier way to solve your problem, but let me think about it a little more. To check that I understand, what you want to get out of this list is the number of kmers within each original set (small) that are shared with those from the other set, within an edit distance of one. Is that correct?
That is correct, meaning that if we have a set of big lists (A.list, B.list, C.list...) and a set of smaller k-mer lists (a.list, b.list, c.list...), then the results should be something like
A a 0
A b 2
A c 0
B a 1
...
Where
A consists of ATG, ATT, TTT
a consists of CCC, GGG
b consists of AGG, CTT, CAC
I'm thinking of writing a script that creates a bloom filter while generating all possible k-mers and writes only the bit array to disk which is later read into memory and compared against with while streaming through the other set of lists.
Simplified commands might look something like this:
cat A_kmers.list | createMismatchedKmers.pl | createBloomFilter.pl > A.bf
cat a_kmers.list | compareListWithBloomFilter.pl A.bf > A_vs_a_maybe.list
doIntersectionBetween A_kmers.list A_vs_a_maybe.list | wc -l
cat b_kmers.list | compareListWithBloomFilter.pl A.bf > A_vs_b_maybe.list
...
I think that a radix tree might be a better fit for exploring strings within an edit distance of one of a set of k-mers, but I'm going to close this for now. Feel free to contact me for further discussion if you're still interested at dnb@cs.jhu.edu.
Hi!
Problem: compare a lot of bigger k-mer lists (n) with somewhat smaller k-mer lists (also n).
The goal is to find how many k-mers are in both lists (intersection size) and bloom filters seem to be ideal for this as the same bit array can be used multiple times and the intersection size is expected to be very small (usually 0). I can convert the k-mer lists into fastas so Dashing can read it but working with bf sketches is a bit obscure.
Should the workflow be something like this?:
--full-containment-dist
??)Note: I want to use all of the k-mers, so no subsetting. Is
--use-full-khash-sets
sufficient or do any other parameters needed?This is part of a larger problem where I'm trying to find the count 1 mismatch edit distance between two sets of k-mer lists. Currently, just creating all possible mismatched k-mers (bigger lists) and intersecting them with smaller lists seems to be the fastest approach, which is still slow.