Open dejsha opened 2 years ago
Hi dejsha,
I'm sorry to hear that you're having difficulties, and I'll try to find out where they're coming from.
I've taken the latest dashing_s128 release and run a few commands from within the Dashing repository. First, I tried without --use-full-khash-sets
, and it worked as expected:
[dbaker49@login02 dashing]$ find bonsai/test/ -name '*fna.gz' > fns.txt
[dbaker49@login02 dashing]$ ./dashing_s128 cmp -F fns.txt --use-full-khash-sets --wj-exact -p4 -k25
Dashing version: v1.0
#Path Size (est.)
bonsai/test/ec/GCF_000008865.1_ASM886v1_genomic.fna.gz 5594398
bonsai/test/ec/GCF_000007445.1_ASM744v1_genomic.fna.gz 5226083
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz 5182481
bonsai/test/ec/GCF_000010245.2_ASM1024v1_genomic.fna.gz 4646303
bonsai/test/ec/GCF_000005845.2_ASM584v2_genomic.fna.gz 4641623
bonsai/test/ec/GCF_000009565.1_ASM956v1_genomic.fna.gz 4558920
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz 2683594
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz 2478048
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz 2449961
##Names bonsai/test/ec/GCF_000008865.1_ASM886v1_genomic.fna.gz bonsai/test/ec/GCF_000007445.1_ASM744v1_genomic.fna.gz bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz bonsai/test/ec/GCF_000010245.2_ASM1024v1_genomic.fna.gz bonsai/test/ec/GCF_000005845.2_ASM584v2_genomic.fna.gz bonsai/test/ec/GCF_000009565.1_ASM956v1_genomic.fna.gz bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz
bonsai/test/ec/GCF_000008865.1_ASM886v1_genomic.fna.gz - 0.275324 3.80446e-06 0.408161 0.408451 0.400065 3.62407e-07 2.47756e-07 2.48621e-07
bonsai/test/ec/GCF_000007445.1_ASM744v1_genomic.fna.gz - - 3.0744e-06 0.304341 0.304541 0.310391 3.79282e-07 2.59601e-07 2.60551e-07
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz - - - 3.3575e-06 3.3591e-06 4.51682e-06 1.65269e-05 3.39414e-05 3.40663e-05
bonsai/test/ec/GCF_000010245.2_ASM1024v1_genomic.fna.gz - - - - 0.995331 0.665783 4.09283e-07 2.80727e-07 2.81839e-07
bonsai/test/ec/GCF_000005845.2_ASM584v2_genomic.fna.gz - - - - - 0.666635 4.09544e-07 2.80912e-07 2.82025e-07
bonsai/test/ec/GCF_000009565.1_ASM956v1_genomic.fna.gz - - - - - - 4.14221e-07 2.84213e-07 2.85352e-07
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz - - - - - - - 0.0091678 0.0088866
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz - - - - - - - - 0.582643
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz - - - - - - - - -
[dbaker49@login02 dashing]$ ./dashing_s128 cmp -F fns.txt --use-full-khash-sets --wj -p4 -k25
Dashing version: v1.0
#Path Size (est.)
bonsai/test/ec/GCF_000008865.1_ASM886v1_genomic.fna.gz 5378246
bonsai/test/ec/GCF_000007445.1_ASM744v1_genomic.fna.gz 5118503
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz 4889417
bonsai/test/ec/GCF_000010245.2_ASM1024v1_genomic.fna.gz 4560493
bonsai/test/ec/GCF_000005845.2_ASM584v2_genomic.fna.gz 4567585
bonsai/test/ec/GCF_000009565.1_ASM956v1_genomic.fna.gz 4481182
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz 2676200
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz 2452166
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz 2424678
##Names bonsai/test/ec/GCF_000008865.1_ASM886v1_genomic.fna.gz bonsai/test/ec/GCF_000007445.1_ASM744v1_genomic.fna.gz bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz bonsai/test/ec/GCF_000010245.2_ASM1024v1_genomic.fna.gz bonsai/test/ec/GCF_000005845.2_ASM584v2_genomic.fna.gz bonsai/test/ec/GCF_000009565.1_ASM956v1_genomic.fna.gz bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz
bonsai/test/ec/GCF_000008865.1_ASM886v1_genomic.fna.gz - 0.169356 2.72702e-06 0.236958 0.233108 0.232379 2.4831e-07 1.27707e-07 1.28157e-07
bonsai/test/ec/GCF_000007445.1_ASM744v1_genomic.fna.gz - - 2.09834e-06 0.20104 0.198809 0.199656 2.56585e-07 1.32089e-07 1.3257e-07
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz - - - 2.22225e-06 2.64355e-06 3.20151e-06 1.32179e-05 1.63455e-05 1.39459e-05
bonsai/test/ec/GCF_000010245.2_ASM1024v1_genomic.fna.gz - - - - 0.546015 0.399542 2.76369e-07 1.42599e-07 1.4316e-07
bonsai/test/ec/GCF_000005845.2_ASM584v2_genomic.fna.gz - - - - - 0.391206 0 0 1.43015e-07
bonsai/test/ec/GCF_000009565.1_ASM956v1_genomic.fna.gz - - - - - - 1.39716e-07 0 1.44805e-07
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz - - - - - - - 0.00571696 0.00364102
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz - - - - - - - - 0.196344
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz - - - - - - - - -
You're right that the weighted cardinality should be higher than the unweighted cardinality. I think part of the issue is that you're using --wj
instead of -wj-exact
. The --wj
method can get out of sync when generating weighted sketches, and so it tends to reduce estimated similarities. --wj-exact
comes at a bit of runtime cost, but it shouldn't have that problem, and it's probably the better flag to use for most applications.
I'm not sure why the cardinalities are ending up zero.. Are these assemblies compressed with something other than gzip? I think Dashing might have trouble reading bz2 and xz compressed files.
Thanks for making the issue, and I'm happy to help more.
Daniel
Hi, I was trying to use Dashing (Dashing version: v0.5.7b, downloaded May 2021) to estimate similarities between metagenomic assemblies (ie. each sample was represented by a set of Metaspades scaffolds). The total size of the assemblies varied between 14 500 thousand bp and 650 thousand bp. I was testing with k=25 and k=12 My intention was to do: