dnbaker / dashing

Fast and accurate genomic distances using HyperLogLog
GNU General Public License v3.0
160 stars 11 forks source link

less accurate estimator compare to MASH/sourmash/ANI #97

Open jianshu93 opened 1 year ago

jianshu93 commented 1 year ago

Hello Daniel,

I am attaching a real-world genome from the global Tara Ocean Metagenomic study, against all GTDB genomes (https://data.ace.uq.edu.au/public/gtdb/data/releases/release207/207.0/genomic_files_reps/gtdb_genomes_reps_r207.tar.gz) to find top 20 best matches in terms of ANI, I am using orthoANI(https://www.microbiologyresearch.org/content/journal/ijsem/10.1099/ijsem.0.000760), Both MASH, and sourmash performs well, normally, 16 to 17 of best found compare to ANI best hits found. However, Dashing (both default MLE estimator and also JMLE) is very bad at ANI smaller than 80%, only 9 (top 10 are fine) are found out of 20, meaning for smaller distance, Dashing is much worse than Mash or sourmash, both are MinHash but not hyperloglog. I was under the impression that Jaccard index by HLL should be as good as MinHash.

This is the commands used:

dashing sketch -k 16 --nthreads 128 -S 14 --ertl-joint-mle --suffix dashing_hll -F name.txt & dashing sketch -k 16 --nthreads 128 -S 14 --ertl-joint-mle --suffix dashing_hll -F query_name.txt

then get all the hll file from the genome folder and create list of those hll files.

dashing dist -F ./query_name_dashing_hll_JMLE.txt -Q name_dashing_hll_JMLE.txt --full-tsv --nthreads 128 --presketched -O ./OceanDNA-b42278.dashing.hll.JMLE.gtdb.txt.

I am using the same k and sketch size (2^14) in Mash and sourmash. Top 10 are ok, nearly all are found. I also compare with our most recent SetSketch 1 implenmentation (equivalent to HLL), ours are consistent with sourmash or Mash. I am showing you the best 10th to 20th hits found to the query (OceanDNA-b42278.fa) by several tools (the attached pdf file, forget top 10 in the table title, it is actually top 10 of 10th to 20th) mentioned above for you to double check. Should I use an even large sketch size to better approximate ANI, I think not because top 10 are already very good, meaning sketch size is enough. Dashing is faster for sure than Mash, I am wondering what could be the down side of being fast, e.g., less accurate for very smaller Jaccard index/distance (not similar ones).

Thanks,

Jianshu

OceanDNA-b42278.fa.zip

Blastn-ANI-dashing-setsketch.pdf

jianshu93 commented 1 year ago

Hello Daniel,

I found that bindash is much more accurate than Dashing fro small Jaccard like those around 0.01 or so. Please see the attached result with additional focus on bindash, using the same query and database genome mentioned above. As you can see, bindash is the best while dashing is the worst, there must be place here dashing sacrifice accuracy for speed. Jaccard around 0.01 is very important because this corresponding to ANI 75% to 78%, where most tools lose accuracy. I don no understand why a theoretical variance J*(1-J)/m (Bindash) is larger than 1.074/m (MLE methods) in practice (use m=10000 or so, assuming inclusion-exclusion is perfect, which is not always true) as claimed in the paper, assuming the same sketch size used, bindash is at least 1000 times more accurate than MLE with inclusion-exclusion rule. It is the same with setsketch, setsketch is also much large variation than bindash because of the nature of approximation used in SetSketch 1 (b=0.001, m=4096). Can you please give more explanation on why Jaccard is more accurat in dashing than bindash? @BenLangmead @dnbaker

Thanks

Jianshu

bindash_dashing.pdf