dib-lab / khmer

In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more
http://khmer.readthedocs.io/
Other
749 stars 294 forks source link

n_unique_kmers() returns 0 upon save/reload #1781

Open dkoslicki opened 6 years ago

dkoslicki commented 6 years ago

I noticed that using the Nodegraph method n_unique_kmers() returns 0 whenever a Nodegraph is saved and then reloaded. Minimum working example:

import khmer
import tempfile
import os
temp_file = tempfile.mktemp()
ng = khmer.Nodegraph(5, 1e5, 4)
ng.add("AAAAA")
print("Number k-mers: %d" % ng.n_unique_kmers())  # Returns correct answer of 1
ng.save(temp_file)
# Reload and try again
ng2 = khmer.load_nodegraph(temp_file)
print("Number k-mers after load: %d" % ng2.n_unique_kmers())  # Returns incorrect answer of 0
os.remove(temp_file)  # clean up

Same thing happens on realistic data.

Oddly enough, the method n_occupied() doesn't seem to have this issue:

print("Number occupied: %d" % ng.n_occupied())
print("Number occupied after load: %d" % ng2.n_occupied())

As an aside: it isn't clear to me (from the documentation) that one should use n_unique_kmers() or n_occupied() if one wants to estimate the number of (distinct) k-mers that are present in a Nodegraph().

betatim commented 6 years ago

Thanks for finding this. Which version of khmer did you use? I managed to reproduce this on master with the slight modification of:

ng3 = khmer.Nodegraph.load(temp_file)
betatim commented 6 years ago

n_unique_kmers isn't stored in the saved file, which is why it comes out as zero after loading a file. Does anyone remember why this was done or simply an oversight?

I think you want to use n_unique_kmers for estimating distinc k-mers. n_occupied keeps track of how many bits are turned on in the first table used in the Nodegraph, not all of them.

dkoslicki commented 6 years ago

It looks like I'm using 2.1.1 (which is what pip install khmer gives). Good to know about n_unique_kmers versus n_occupied; the source of my confusion was the fact that (when creating a Nodegraph, not loading one), n_unique_kmers and n_occupied were always returning the same value on real data (though I have subsequently found an artificial example where they differ).

ctb commented 6 years ago

This is an oversight, @betatim!

@dkoslicki, the HyperLogLog stuff (in unique-kmers.py and elsewhere) is probably the best way to go for estimating k-mer counts, if that's feasible. @luizirber may have some other thoughts as I know he is working on similar stuff for sourmash.

Also, #1248 needs to be linked in...

dkoslicki commented 6 years ago

Thanks @ctb; I am already forming the Nodegraph's, so thought n_unique_kmers would be a handy way to circumvent repeat calculations.