Kingsford-Group / bloomtree

GNU General Public License v3.0
73 stars 14 forks source link

question about selecting bf_size and number of hashes #13

Closed droe-nmdp closed 8 years ago

droe-nmdp commented 8 years ago

Can you give me suggestions on how to select bf_size and number of hashes? For example, for counting a fastq with approximately 600 million reads, jellyfish histo outputs

jellyfish histo mer_counts.jf -h 2 1 14440813945 2 598353480 3 2666010785

I'm not sure exactly what those numbers mean or how I should apply them to _size and number of hashes.

droe-nmdp commented 8 years ago

More specifically, I get this output (sometimes also with 'what(): Hash full') terminate called recursively Aborted (core dumped) via this command bt count --cutoff 50 --threads 10 testHash 14000000000 test.fastq test.bf.bv and it happens when I vary the bloom filter size by an order of magnitude either way and varying the number of hash functions between 1 and 5.

Bradsol commented 8 years ago

I'll try to answer each of your questions in turn.

First the question about Jellyfish output: The Jellyfish histogram output bins all kmers based on their number of occurrences. So there are 14440813945 kmers which occur once (which you should probably ignore as sequencing errors unless you want really high sensitivity and are okay with a high false positive rate). The remaining kmers represent the size of the unique elements in the file with the last bin holds everything of size >=3. This gives you a good minimum size for a single bloom filter but doesn't really help you determine the Bloomtree sizes.

While we are working on a more automatic / rigorous process for selecting BF sizes, here is an explanation for the numbers we selected in our own work that might assist you in selecting your sizes:

We used a bf_size of 2,000,000,000 (2e9) for our experiments. To reach this number we counted the combined kmer content of 100 random files from the 2,652 files used in our analysis and reached an estimated kmer count of 1,902,731,933 unique kmers. The rationale for this size is two-fold:

1) While we want to ensure that the union of many filters remains unsaturated, the accuracy of the leaves [which contain only a single experiment] is the only concern for false positive results. To understand this, observe that any query hit in an internal node of the tree proceeds further down the tree until it reaches the leaf. Only once the query has passed the leaf is it [the query] said to have found a matching experiment. Thus any size which is larger than a single experiments maximum kmer content is sufficient to achieve the same accuracy results.

2) Given that the accuracy is largely fixed provided there's enough room for a single experiment, the question of size becomes a question of what is efficient to store and process. Saturation in the internal nodes of the tree can dramatically increase search time. Essentially if the filter is full of 1s, it always return a "kmer hit" and fail to prune the tree while taking up valuable query time to process. At the same time, reducing saturation by increasing the filter size can negatively affect build and query times due to the increased read/write load associated with each file.

As further evidence for our selection we noted that 2e9 was not a significant burden in terms of size (239 MB / file) and that the filters near the root the tree built at this size are still useful [queries known to not exist in the tree terminate in the first few levels of the hierarchy].

I prefer to think of the bf_size parameter as a user input setting that requires a bit of experimenting - there is a clear minimum size associated with any single experiment's estimated kmer count but the maximum size is dependent on multiple factors including hardware concerns (read/write and storage costs), the estimated similarity in kmer content (two files with high overlap do not increase saturation), and a potential tradeoff between query speed and query accuracy based on both the data and hardware.

Bradsol commented 8 years ago

Lastly I cannot debug the errors you are getting based on the count command provided alone. Very likely there is something wrong with either your hash or your test.fastq. If you are still having this error, please email the files in question (test.fastq and hash [and the command you used to generate the hash]) and I'll see what I can do.

droe-nmdp commented 8 years ago

Thank you, that was helpful regardless of my immediate issue.

It looks like my problem has to do with the format of the fastq file as provided by samtools. When I create the fastq via samtools view [-uf64|-uf128] | samtools bam2fq -nO I get the problem.

If I use samtools to extract to fasta, the 'bt count' runs to completion.