dib-lab / khmer

In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more
http://khmer.readthedocs.io/
Other
749 stars 294 forks source link

test_load_graph_multithread() in test_scripts.py finds different unique k-mers #1248

Open johnsolk opened 9 years ago

johnsolk commented 9 years ago

Working on https://github.com/dib-lab/khmer/issues/998, using this command:

./setup.py nosetests --tests tests/test_scripts.py:test_load_graph_multithread

default args:

args = ['-N', '4', '-x', '1e7', '-T', '8', outfile, infile]

(ran ~ 20 x) displayed 4 different unique k-mers in stdout:

Total number of unique k-mers: 1124051 Total number of unique k-mers: 1124050 Total number of unique k-mers: 1124049 Total number of unique k-mers: 1124052

When change args to -T 1,only one unique k-mer (ran ~20x):

Total number of unique k-mers: 1124049

OS info:

ProductName: Mac OS X ProductVersion: 10.9.5 BuildVersion: 13F34

mr-c commented 9 years ago

Right, this sort of thing has been brewing for a while. It could be due to the probabilistic nature of the Count-Min Sketch or it could be a threading bug. @luizirber @camillescott may have insights to share

ctb commented 9 years ago

On Thu, Aug 13, 2015 at 02:08:58PM -0700, Michael R. Crusoe wrote:

Right, this sort of thing has been brewing for a while. It could be due to the probabilistic nature of the Count-Min Sketch or it could be a threading bug. @luizirber @camillescott may have insights to share

Having dug around in that code recently (and now simplified it), it shouldn't be happening if the code was working the way I understood it to work. So there's almost certainly a bug... Thanks, @ljcohen!

Let's take a look through the issue tracker to find you a new bug to chase while we figure this out :)

mr-c commented 9 years ago

@ctb Is this "known-issues" worthy?

ctb commented 9 years ago

On Fri, Sep 04, 2015 at 09:49:16AM -0700, Michael R. Crusoe wrote:

@ctb Is this "known-issues" worthy?

yep.

camillescott commented 9 years ago

This should be resolved by #876 (and serve as extra incentive for me to get that merged)

betatim commented 7 years ago

Another occurrence of a Heisenbug in test_abundance_dist_threaded this time.

betatim commented 7 years ago

I'm looking into both of these.

For reference:

$ for n in {1..20}; do scripts/abundance-dist-single.py -s -x 1e7 -N 2 -k 17 -z --threads 2 tests/test-data/test-reads.fa /tmp/test.dist 2>&1 | grep Total; done
$ for n in {1..20}; do scripts/load-graph.py -N 4 -x 1e7 -T 8 /tmp/test tests/test-data/test-reads.fa 2>&1 | grep "Total"; done;

both produce varying numbers of unique kmers when executed with more than one thread.

betatim commented 7 years ago

The easy fix is to add a lock at a fairly highlevel which wipes out all the benefits of having more threads.

The long term solution is to have a queue between the reader and the hashtable. Then you can have multiple threads for reading (in a chunked fashion) the input file and separately control the number of threads that consume the reads from the queue. At the very least we need a Parser instance per thread that does not share anything with others. Right now the filling of a read is protected by a home made lock, but there doesn't seem to be anything to stop other threads from calling read->reset() at any moment in time, or replacing the contents of the read while one thread is still using it to fill a hashtable.

(When reading a gzip'ed file you probably want several producers as that takes quite a bit of CPU. For a plain text file a single thread probably is enough and you'd want more threads consuming the reads.)

betatim commented 7 years ago

The gzip format makes it ~impossible to do stuff in parallel. Tools like pigz speed up compression but can't do much for decompression (~6s for gunzip or pigz on ecoli).

So start with a single threaded reader, into a Q, and consume reads in parallel.

betatim commented 7 years ago

After some digging into this and failing to reproduce it I noticed that the bug appears if you reduce the size of the BF. From that I gathered that there isn't a bug anywhere but that whether or not a kmer is "new" depends on which kmers were entered into the BF before it. With only one thread the order is always the same, but with multiple threads the order changes between re-runs. Example:

simple BF that has 9 'slots' and 3 hash functions

three kmers with bit patterns:
A: 100010100
B: 000010101
C: 111000000

enter as:
ABC-> 3 unique kmers
CAB-> 3 unique kmers
BCA-> 2 unique kmers

So even with no bug you can get a different answer on how many unique kmers there are. The reason you don't see it for (very) large BFs is probably because the BF is super sparse.

betatim commented 7 years ago

Short gist with some code to reproduce this https://gist.github.com/betatim/35112d0d6fd0ad543748ae3355cd0cbc

ctb commented 7 years ago

This makes sense to me. Not sure what to do about it. but it makes sense to me.

betatim commented 7 years ago

I don't think there is anything that can be done.

ctb commented 7 years ago

@betatim @luizirber take a look at https://en.wikipedia.org/wiki/Bloom_filter#Approximating_the_number_of_items_in_a_Bloom_filter - this might be a better (approximate) way of computing n_unique_kmers.