dnbaker / dashing

Fast and accurate genomic distances using HyperLogLog
GNU General Public License v3.0
160 stars 11 forks source link

Emit Intersection Size #38

Open dnbaker opened 4 years ago

dnbaker commented 4 years ago

Currently, dashing emits union sizes with --sizes, but not intersection size. You can get to that by subtracting the estimated set cardinality, but it would be preferable to emit it directly.

dnbaker commented 4 years ago

Fixed as of https://github.com/dnbaker/dashing/commit/22e340a2f428582b9e49f2570834f34ec205d193. (Emits intersection size instead of union size.)

mihkelvaher commented 4 years ago

Maybe it's still WIP, but with dashing dist myfastas/* --size v0.4.2 outputs paths/sizes ##Names and a matrix, but with v0.4.3-16-g4a39 instead of the matrix there's this:

terminate called after throwing an instance of 'sketch::exception::NotImplementedError'
  what():  union_size not available for type double bns::us::union_size(const T&, const T&) [with T = sketch::hll::hllbase_t<>]
Aborted

Using just dashing dist myfastas/* without --size works fine.

EDIT: working with the v0.4.4 release EDIT2: v0.4.4 emits union sizes, not intersection

dnbaker commented 4 years ago

Hi Mihkel,

Thanks for reporting. I made a mistake making this change.

It's been patched, both in master at this commit and v0.4.5.

Would you give it another try for me?

Sample output (the files with cp are copies, so the expected intersection is complete):

Dashing version: v0.4.3-16-g4a39
#Path   Size (est.)
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz  4829255
bonsai/test/cp.GCF_001723155.1_ASM172315v1_genomic.fna.gz   4829255
bonsai/test/cp2.GCF_001723155.1_ASM172315v1_genomic.fna.gz  4829255
bonsai/test/cp2.cp.GCF_001723155.1_ASM172315v1_genomic.fna.gz   4829255
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz   2718859
bonsai/test/cp.GCF_000302455.1_ASM30245v1_genomic.fna.gz    2718859
bonsai/test/cp2.GCF_000302455.1_ASM30245v1_genomic.fna.gz   2718859
bonsai/test/cp2.cp.GCF_000302455.1_ASM30245v1_genomic.fna.gz    2718859
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz  2433839
bonsai/test/cp.GCF_000953115.1_DSM1535_genomic.fna.gz   2433839
bonsai/test/cp2.GCF_000953115.1_DSM1535_genomic.fna.gz  2433839
bonsai/test/cp2.cp.GCF_000953115.1_DSM1535_genomic.fna.gz   2433839
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz   2368528
bonsai/test/cp.GCF_000762265.1_ASM76226v1_genomic.fna.gz    2368528
bonsai/test/cp2.GCF_000762265.1_ASM76226v1_genomic.fna.gz   2368528
bonsai/test/cp2.cp.GCF_000762265.1_ASM76226v1_genomic.fna.gz    2368528
##Names bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz  bonsai/test/cp.GCF_001723155.1_ASM172315v1_genomic.fna.gz   bonsai/test/cp2.GCF_001723155.1_ASM172315v1_genomic.fna.gz  bonsai/test/cp2.cp.GCF_001723155.1_ASM172315v1_genomic.fna.gz   bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz   bonsai/test/cp.GCF_000302455.1_ASM30245v1_genomic.fna.gz    bonsai/test/cp2.GCF_000302455.1_ASM30245v1_genomic.fna.gz   bonsai/test/cp2.cp.GCF_000302455.1_ASM30245v1_genomic.fna.gz    bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz  bonsai/test/cp.GCF_000953115.1_DSM1535_genomic.fna.gz   bonsai/test/cp2.GCF_000953115.1_DSM1535_genomic.fna.gz  bonsai/test/cp2.cp.GCF_000953115.1_DSM1535_genomic.fna.gz   bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz   bonsai/test/cp.GCF_000762265.1_ASM76226v1_genomic.fna.gz    bonsai/test/cp2.GCF_000762265.1_ASM76226v1_genomic.fna.gz   bonsai/test/cp2.cp.GCF_000762265.1_ASM76226v1_genomic.fna.gz
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz  -   4.82926e+06 4.82926e+06 4.82926e+06 0   0   0   0   0   0   0   0   0   0   00
bonsai/test/cp.GCF_001723155.1_ASM172315v1_genomic.fna.gz   -   -   4.82926e+06 4.82926e+06 0   0   0   0   0   0   0   0   0   0   00
bonsai/test/cp2.GCF_001723155.1_ASM172315v1_genomic.fna.gz  -   -   -   4.82926e+06 0   0   0   0   0   0   0   0   0   0   0   0
bonsai/test/cp2.cp.GCF_001723155.1_ASM172315v1_genomic.fna.gz   -   -   -   -   0   0   0   0   0   0   0   0   0   0   0   0
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz   -   -   -   -   -   2.71886e+06 2.71886e+06 2.71886e+06 0   0   0   0   0   0   00
bonsai/test/cp.GCF_000302455.1_ASM30245v1_genomic.fna.gz    -   -   -   -   -   -   2.71886e+06 2.71886e+06 0   0   0   0   0   0   00
bonsai/test/cp2.GCF_000302455.1_ASM30245v1_genomic.fna.gz   -   -   -   -   -   -   -   2.71886e+06 0   0   0   0   0   0   0   0
bonsai/test/cp2.cp.GCF_000302455.1_ASM30245v1_genomic.fna.gz    -   -   -   -   -   -   -   -   0   0   0   0   0   0   0   0
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz  -   -   -   -   -   -   -   -   -   2.43384e+06 2.43384e+06 2.43384e+06 1.70487e+06 1.70487e+06 1.70487e+06 1.70487e+06
bonsai/test/cp.GCF_000953115.1_DSM1535_genomic.fna.gz   -   -   -   -   -   -   -   -   -   -   2.43384e+06 2.43384e+06 1.70487e+06 1.70487e+06 1.70487e+06 1.70487e+06
bonsai/test/cp2.GCF_000953115.1_DSM1535_genomic.fna.gz  -   -   -   -   -   -   -   -   -   -   -   2.43384e+06 1.70487e+06 1.70487e+06 1.70487e+06 1.70487e+06
bonsai/test/cp2.cp.GCF_000953115.1_DSM1535_genomic.fna.gz   -   -   -   -   -   -   -   -   -   -   -   -   1.70487e+06 1.70487e+06 1.70487e+06 1.70487e+06
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz   -   -   -   -   -   -   -   -   -   -   -   -   -   2.36853e+06 2.36853e+06 2.36853e+06
bonsai/test/cp.GCF_000762265.1_ASM76226v1_genomic.fna.gz    -   -   -   -   -   -   -   -   -   -   -   -   -   -   2.36853e+06 2.36853e+06
bonsai/test/cp2.GCF_000762265.1_ASM76226v1_genomic.fna.gz   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   2.36853e+06
bonsai/test/cp2.cp.GCF_000762265.1_ASM76226v1_genomic.fna.gz    -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
mihkelvaher commented 4 years ago

TL;DR: -T vs default gives different results.

Using another k-mer program with the same k I found some differences that should not be there. dashing dist fastas/* --sizes -T --use-full-khash-sets --kmer-length 32 Dashing vs real 60 vs 54 747 vs 711 ... Omitting -T gives the exact results (which should be the case with --use-full-khash-sets?).

Also, a minor detail (but probably a bit tricky to fix): -T reports intersections with self to be 0.

dnbaker commented 4 years ago

Thanks for the find, I'll investigate this soon. You're right that it'd be a big tricky to have -T report intersections along the diagonal, but checking out the off-diagonal entries is important.

mihkelvaher commented 4 years ago

In practice, the zero info on the diagonal is more or less irrelevant so it can be pushed to the bottom of the TODO list.

Comparing the results of different data structures vs full khash sets I found that --sizes doesn't work with bloom nor super minhash. Maybe it's intentional.

Some testing results, might be interesting: Currently, bb-minhash seems to outperform others (intersection sizes in correlation with full):

HLL_E   0,14208915
HLL_I   0,14208915
HLL_J   0,153491658
countingRangeMinhash    0,833072532
rangeMinhash    0,848715951
bb-minhash  0,93973125

The values are found on a small set of full genomes where only smaller intersection sizes are observed with --sketch-size 17. Taking into account more (larger) intersection sizes the correlations become more similar with bb-minhash still at the top.

dnbaker commented 4 years ago

I'm closing this for now, but feel free to open if you have any further issues.

mihkelvaher commented 4 years ago

I managed to hide the comment previously so well I had trouble myself finding it 😄 :

Any plans on implementing --sizes with bloom filter (super minhash also didn't work the last time I tried)? I'm asking this just because using bloom filters the estimated intersection size should be more likely overestimated than underestimated (?) which is preferred for my project (afraid more of FNs than FPs).

dnbaker commented 4 years ago

I see. I can add this sometime relatively soon. I would guess that the bloom filter would overestimate.

I'm not sure if there's any way to arrive at cardinality estimates using SuperMinHash (and thereby intersection sizes), but I'll see what I can do. It uses random number generators and sampling to get values with given probabilities.

I'll probably have time in a couple of weeks, and thanks for the reminder.