dnbaker / dashing

Fast and accurate genomic distances using HyperLogLog
GNU General Public License v3.0
160 stars 11 forks source link

Distances > 0.05 but < 1 are unreliable? #42

Open wwood opened 4 years ago

wwood commented 4 years ago

Hi again,

I've been using dashing as a prefilter for genome dereplication, since it is much faster than FastANI. I'd previously been using mash for this. I've noticed that some genomes are given distances that are between 0.05 and 0.10, but seem to be spurious. For instance, here's mash distance vs. dashing distance calculated with -M:

image

I tested 10 randomly chosen genomes from that top stripe where mash=1 and dashing<1, and none seemed closely related genomes, so it doesn't seem that dashing is simply producing better estimates. The issue does seem to be reasonably widespread at least in this dataset - dashing predicts 49% of genome pairs < 1, where mash predicts 4%.

Is this a known issue? Am I not using dashing correctly? Is there some way I can detect these cases?

Thanks, ben.

mihkelvaher commented 4 years ago

Hi!

I just recently encountered a similar issue. When making a tree from the distance matrix and just taking a look at how different bacterial strains were clustered together, some species were mixed. I got a decent result with these arguments: --sketch-size 20 -J --use-range-minhash --full-mash-dist If I remember correctly, -J and --use-range-minhash had the greatest impact.

Mihkel

dnbaker commented 4 years ago

This is rather interesting, as for the paper our end result for measuring Jaccard Index accuracy, whereas the mash distance is a log transform downstream.

Some of the issue could be sketch size (though -S20 is getting to be comparable to the genome sizes), where Mash defaults to 4Kb sketches and dashing defaults to 1Kb. (Being equivalent would be -S/--sketch-size 12.) From the paper, it seems like b-bit minhash is marginally more accurate at low Jaccard Index but less accurate at higher ones. For low JI (larger distances), I might instead try --use-bb-minhash/-8, which is still about as fast and accurate. (You can tune the number of bits with -B/--bbits.)

As a comment, -J is a calculation method for HLLs, so it isn't used when --use-range-minhash is active, and it depends on what your use case is. -J is more accurate than not for HLLs, though at a runtime penalty.

I'll agree that --full-mash-dist is likely worth doing, as it removes a layer of approximation in the calculation. It only makes a difference for small Jaccards, but that seems to be important.