Open dkoslicki opened 7 years ago
Thanks for finding this. Which version of khmer
did you use? I managed to reproduce this on master
with the slight modification of:
ng3 = khmer.Nodegraph.load(temp_file)
n_unique_kmers
isn't stored in the saved file, which is why it comes out as zero after loading a file. Does anyone remember why this was done or simply an oversight?
I think you want to use n_unique_kmers
for estimating distinc k-mers. n_occupied
keeps track of how many bits are turned on in the first table used in the Nodegraph
, not all of them.
It looks like I'm using 2.1.1 (which is what pip install khmer
gives).
Good to know about n_unique_kmers
versus n_occupied
; the source of my confusion was the fact that (when creating a Nodegraph
, not loading one), n_unique_kmers
and n_occupied
were always returning the same value on real data (though I have subsequently found an artificial example where they differ).
This is an oversight, @betatim!
@dkoslicki, the HyperLogLog stuff (in unique-kmers.py and elsewhere) is probably the best way to go for estimating k-mer counts, if that's feasible. @luizirber may have some other thoughts as I know he is working on similar stuff for sourmash.
Also, #1248 needs to be linked in...
Thanks @ctb; I am already forming the Nodegraph
's, so thought n_unique_kmers
would be a handy way to circumvent repeat calculations.
I noticed that using the Nodegraph method
n_unique_kmers()
returns 0 whenever a Nodegraph is saved and then reloaded. Minimum working example:Same thing happens on realistic data.
Oddly enough, the method
n_occupied()
doesn't seem to have this issue:As an aside: it isn't clear to me (from the documentation) that one should use
n_unique_kmers()
orn_occupied()
if one wants to estimate the number of (distinct) k-mers that are present in a Nodegraph().