Open dnbaker opened 4 years ago
Fixed as of https://github.com/dnbaker/dashing/commit/22e340a2f428582b9e49f2570834f34ec205d193. (Emits intersection size instead of union size.)
Maybe it's still WIP, but with
dashing dist myfastas/* --size
v0.4.2
outputs paths/sizes ##Names and a matrix, but with v0.4.3-16-g4a39
instead of the matrix there's this:
terminate called after throwing an instance of 'sketch::exception::NotImplementedError'
what(): union_size not available for type double bns::us::union_size(const T&, const T&) [with T = sketch::hll::hllbase_t<>]
Aborted
Using just dashing dist myfastas/*
without --size
works fine.
EDIT: working with the v0.4.4 release EDIT2: v0.4.4 emits union sizes, not intersection
Hi Mihkel,
Thanks for reporting. I made a mistake making this change.
It's been patched, both in master at this commit and v0.4.5.
Would you give it another try for me?
Sample output (the files with cp are copies, so the expected intersection is complete):
Dashing version: v0.4.3-16-g4a39
#Path Size (est.)
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz 4829255
bonsai/test/cp.GCF_001723155.1_ASM172315v1_genomic.fna.gz 4829255
bonsai/test/cp2.GCF_001723155.1_ASM172315v1_genomic.fna.gz 4829255
bonsai/test/cp2.cp.GCF_001723155.1_ASM172315v1_genomic.fna.gz 4829255
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz 2718859
bonsai/test/cp.GCF_000302455.1_ASM30245v1_genomic.fna.gz 2718859
bonsai/test/cp2.GCF_000302455.1_ASM30245v1_genomic.fna.gz 2718859
bonsai/test/cp2.cp.GCF_000302455.1_ASM30245v1_genomic.fna.gz 2718859
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz 2433839
bonsai/test/cp.GCF_000953115.1_DSM1535_genomic.fna.gz 2433839
bonsai/test/cp2.GCF_000953115.1_DSM1535_genomic.fna.gz 2433839
bonsai/test/cp2.cp.GCF_000953115.1_DSM1535_genomic.fna.gz 2433839
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz 2368528
bonsai/test/cp.GCF_000762265.1_ASM76226v1_genomic.fna.gz 2368528
bonsai/test/cp2.GCF_000762265.1_ASM76226v1_genomic.fna.gz 2368528
bonsai/test/cp2.cp.GCF_000762265.1_ASM76226v1_genomic.fna.gz 2368528
##Names bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz bonsai/test/cp.GCF_001723155.1_ASM172315v1_genomic.fna.gz bonsai/test/cp2.GCF_001723155.1_ASM172315v1_genomic.fna.gz bonsai/test/cp2.cp.GCF_001723155.1_ASM172315v1_genomic.fna.gz bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz bonsai/test/cp.GCF_000302455.1_ASM30245v1_genomic.fna.gz bonsai/test/cp2.GCF_000302455.1_ASM30245v1_genomic.fna.gz bonsai/test/cp2.cp.GCF_000302455.1_ASM30245v1_genomic.fna.gz bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz bonsai/test/cp.GCF_000953115.1_DSM1535_genomic.fna.gz bonsai/test/cp2.GCF_000953115.1_DSM1535_genomic.fna.gz bonsai/test/cp2.cp.GCF_000953115.1_DSM1535_genomic.fna.gz bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz bonsai/test/cp.GCF_000762265.1_ASM76226v1_genomic.fna.gz bonsai/test/cp2.GCF_000762265.1_ASM76226v1_genomic.fna.gz bonsai/test/cp2.cp.GCF_000762265.1_ASM76226v1_genomic.fna.gz
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz - 4.82926e+06 4.82926e+06 4.82926e+06 0 0 0 0 0 0 0 0 0 0 00
bonsai/test/cp.GCF_001723155.1_ASM172315v1_genomic.fna.gz - - 4.82926e+06 4.82926e+06 0 0 0 0 0 0 0 0 0 0 00
bonsai/test/cp2.GCF_001723155.1_ASM172315v1_genomic.fna.gz - - - 4.82926e+06 0 0 0 0 0 0 0 0 0 0 0 0
bonsai/test/cp2.cp.GCF_001723155.1_ASM172315v1_genomic.fna.gz - - - - 0 0 0 0 0 0 0 0 0 0 0 0
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz - - - - - 2.71886e+06 2.71886e+06 2.71886e+06 0 0 0 0 0 0 00
bonsai/test/cp.GCF_000302455.1_ASM30245v1_genomic.fna.gz - - - - - - 2.71886e+06 2.71886e+06 0 0 0 0 0 0 00
bonsai/test/cp2.GCF_000302455.1_ASM30245v1_genomic.fna.gz - - - - - - - 2.71886e+06 0 0 0 0 0 0 0 0
bonsai/test/cp2.cp.GCF_000302455.1_ASM30245v1_genomic.fna.gz - - - - - - - - 0 0 0 0 0 0 0 0
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz - - - - - - - - - 2.43384e+06 2.43384e+06 2.43384e+06 1.70487e+06 1.70487e+06 1.70487e+06 1.70487e+06
bonsai/test/cp.GCF_000953115.1_DSM1535_genomic.fna.gz - - - - - - - - - - 2.43384e+06 2.43384e+06 1.70487e+06 1.70487e+06 1.70487e+06 1.70487e+06
bonsai/test/cp2.GCF_000953115.1_DSM1535_genomic.fna.gz - - - - - - - - - - - 2.43384e+06 1.70487e+06 1.70487e+06 1.70487e+06 1.70487e+06
bonsai/test/cp2.cp.GCF_000953115.1_DSM1535_genomic.fna.gz - - - - - - - - - - - - 1.70487e+06 1.70487e+06 1.70487e+06 1.70487e+06
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz - - - - - - - - - - - - - 2.36853e+06 2.36853e+06 2.36853e+06
bonsai/test/cp.GCF_000762265.1_ASM76226v1_genomic.fna.gz - - - - - - - - - - - - - - 2.36853e+06 2.36853e+06
bonsai/test/cp2.GCF_000762265.1_ASM76226v1_genomic.fna.gz - - - - - - - - - - - - - - - 2.36853e+06
bonsai/test/cp2.cp.GCF_000762265.1_ASM76226v1_genomic.fna.gz - - - - - - - - - - - - - - - -
TL;DR: -T
vs default gives different results.
Using another k-mer program with the same k I found some differences that should not be there.
dashing dist fastas/* --sizes -T --use-full-khash-sets --kmer-length 32
Dashing vs real
60 vs 54
747 vs 711
...
Omitting -T
gives the exact results (which should be the case with --use-full-khash-sets
?).
Also, a minor detail (but probably a bit tricky to fix): -T
reports intersections with self to be 0.
Thanks for the find, I'll investigate this soon. You're right that it'd be a big tricky to have -T report intersections along the diagonal, but checking out the off-diagonal entries is important.
In practice, the zero info on the diagonal is more or less irrelevant so it can be pushed to the bottom of the TODO list.
Comparing the results of different data structures vs full khash sets I found that --sizes
doesn't work with bloom nor super minhash. Maybe it's intentional.
Some testing results, might be interesting: Currently, bb-minhash seems to outperform others (intersection sizes in correlation with full):
HLL_E 0,14208915
HLL_I 0,14208915
HLL_J 0,153491658
countingRangeMinhash 0,833072532
rangeMinhash 0,848715951
bb-minhash 0,93973125
The values are found on a small set of full genomes where only smaller intersection sizes are observed with --sketch-size 17
. Taking into account more (larger) intersection sizes the correlations become more similar with bb-minhash still at the top.
I'm closing this for now, but feel free to open if you have any further issues.
I managed to hide the comment previously so well I had trouble myself finding it 😄 :
Any plans on implementing --sizes
with bloom filter (super minhash also didn't work the last time I tried)?
I'm asking this just because using bloom filters the estimated intersection size should be more likely overestimated than underestimated (?) which is preferred for my project (afraid more of FNs than FPs).
I see. I can add this sometime relatively soon. I would guess that the bloom filter would overestimate.
I'm not sure if there's any way to arrive at cardinality estimates using SuperMinHash (and thereby intersection sizes), but I'll see what I can do. It uses random number generators and sampling to get values with given probabilities.
I'll probably have time in a couple of weeks, and thanks for the reminder.
Currently, dashing emits union sizes with
--sizes
, but not intersection size. You can get to that by subtracting the estimated set cardinality, but it would be preferable to emit it directly.