dnbaker / dashing

Fast and accurate genomic distances using HyperLogLog
GNU General Public License v3.0
160 stars 11 forks source link

Using bloom filter to perform many to many comparison #37

Closed mihkelvaher closed 4 years ago

mihkelvaher commented 4 years ago

Hi!

Problem: compare a lot of bigger k-mer lists (n) with somewhat smaller k-mer lists (also n).

The goal is to find how many k-mers are in both lists (intersection size) and bloom filters seem to be ideal for this as the same bit array can be used multiple times and the intersection size is expected to be very small (usually 0). I can convert the k-mer lists into fastas so Dashing can read it but working with bf sketches is a bit obscure.

Should the workflow be something like this?:

  1. Create bloom filters for every big list
  2. Create a sketch for every small list
  3. Compare each bloom filter with every small list sketch (how?)
  4. (Optional) check if resulting probably-yes k-mers are actually present (how?)
  5. Output the intersection sizes (number of k-mers present in both lists) (--full-containment-dist??)

Note: I want to use all of the k-mers, so no subsetting. Is --use-full-khash-sets sufficient or do any other parameters needed?

This is part of a larger problem where I'm trying to find the count 1 mismatch edit distance between two sets of k-mer lists. Currently, just creating all possible mismatched k-mers (bigger lists) and intersecting them with smaller lists seems to be the fastest approach, which is still slow.

dnbaker commented 4 years ago

Well, it depends on exactly what you want to do.

To make a set of sketches for a set of genomes, using dashing sketch -p${threads} --use-bloom-filter -F${fnames} -S${log2_sketch_size}, at which point a bloom filter will be cached for each of them.

Let's say big_list.txt is a list of fastas and small_list1.txt and small_list2.txt lists of smaller fastas.

The default Dashing comparison is all vs all, but to compare set A against set B, use the -Q, --query-paths argument:

dashing dist -Q query_paths.txt -F reference_paths.txt --use-bloom-filter --sizes with emit the rectangular matrix of union sizes between paths in query_paths.txt and paths in reference_paths.txt, which you could subtract the estimated cardinality of the genome itself from to get the intersection, or --containment-index to get $\frac{|A \cap B|}{|A|}$ and multiply by that cardinality. We don't currently support intersection size, but I'd like to add that.

For this, you'd get an estimate of the number of shared kmers. To get an exact answer, --use-full-khash-sets works for all of these methods as well, but I don't quite see how the bloom filters would come into play this way.

I imagine there's probably an easier way to solve your problem, but let me think about it a little more. To check that I understand, what you want to get out of this list is the number of kmers within each original set (small) that are shared with those from the other set, within an edit distance of one. Is that correct?

mihkelvaher commented 4 years ago

That is correct, meaning that if we have a set of big lists (A.list, B.list, C.list...) and a set of smaller k-mer lists (a.list, b.list, c.list...), then the results should be something like

A a 0
A b 2
A c 0
B a 1
...

Where A consists of ATG, ATT, TTT a consists of CCC, GGG b consists of AGG, CTT, CAC

I'm thinking of writing a script that creates a bloom filter while generating all possible k-mers and writes only the bit array to disk which is later read into memory and compared against with while streaming through the other set of lists. Simplified commands might look something like this: cat A_kmers.list | createMismatchedKmers.pl | createBloomFilter.pl > A.bf

cat a_kmers.list | compareListWithBloomFilter.pl A.bf > A_vs_a_maybe.list doIntersectionBetween A_kmers.list A_vs_a_maybe.list | wc -l

cat b_kmers.list | compareListWithBloomFilter.pl A.bf > A_vs_b_maybe.list ...

dnbaker commented 4 years ago

I think that a radix tree might be a better fit for exploring strings within an edit distance of one of a set of k-mers, but I'm going to close this for now. Feel free to contact me for further discussion if you're still interested at dnb@cs.jhu.edu.