dnbaker / dashing

Fast and accurate genomic distances using HyperLogLog
GNU General Public License v3.0
161 stars 11 forks source link

weighted Jaccard together with full hash sets - zero genome sizes #89

Open dejsha opened 2 years ago

dejsha commented 2 years ago

Hi, I was trying to use Dashing (Dashing version: v0.5.7b, downloaded May 2021) to estimate similarities between metagenomic assemblies (ie. each sample was represented by a set of Metaspades scaffolds). The total size of the assemblies varied between 14 500 thousand bp and 650 thousand bp. I was testing with k=25 and k=12 My intention was to do:

  1. "normal" Jaccard (counts the presence/absence of distinct kmers)
  2. weighted Jaccard (weights the abundance of the distinct kmers, i.e. occurence of every distinct kmer is counted?) Both of them I wanted to do "exactly" - to calculate with exact numbers of the kmers in the samples. For 1. "normal Jaccard" I tried: ./dashing_s128 cmp -k 25 -F paths.txt -O j_similarities.txt -0 j_sizes.txt --use-full-khash-sets For 2. "weighted Jaccard" I tried: ./dashing_s128 cmp -k 25 -F paths.txt -O wj_similarities.txt -0 wj_sizes.txt --use-full-khash-sets --wj I also did the same two commands but with --sizes to see absloute numbers of shared k-mers. But, I guess that I did it wrong, at least for the weighted Jaccard as I recieved zero genome sizes in wj_sizes.txt for all samples and with both tested kmer lengths (but nothing in standard error). But, despite zeros for genome sizes, I still received non-zero outputs for the wj_similarities.txt and for wj_shared_kmers.txt (the latter produced with --sizes option). Strangely, if I compared the absolute numbers of shared kmers (with --sizes) between the same samples for weightedJ and normalJ, the "weighted jaccard"/"normal jaccard" proportions were cca 0.55 for k=25 and cca 0.85 for k=12. I was quite confused as I would assume higher (or, at extreme the same - if all the kmers which occured inside the samples were distinct) absolute numbers of shared kmers for weighted Jaccard than for normal Jaccard. So, I thought that I was incorrectly using the --use-full-khash-sets --wj combination - that it does not work together in the way I thought it was working. I tried with omitting --use-full-khash-sets and the HLL sketches gave me similar results for the number of shared kmers, again the proportions of shared kmers for "weighted jaccard"/"normal jaccard" were cca 0.6 for k=25 and cca 0.9 for k=12 (though the variance in the proportions between the sample comparisons was higher compared to when using --use-full-khash-sets). When omitting --use-full-khash-sets I recieved non-zero genome sizes for both "normal jaccard" and "weighted jaccard", and therefore I could compare the genome sizes for weighted J vs normal J and for different kmer lengths. And I was confused again - my understanding was that I should always receive higher (or at extreme the same) genome sizes for weighted (all kmers) compared to normal (only distinct kmers) jaccards - when comparing the same samples at the same kmer length. This was true for k=12 - I got higher genome size for weighted compared to normal jaccard, but for k=25 it was not always the case - for some samples the genome sizes were a bit higher for normal jaccard. But, for k=25, almost all kmers were probably distinct in each sample as the genome sizes were similar to the assembly sizes for both normal and weighted J - i.e. max 14 100 thousand and min 640 thousand for normal J and max 14 500 and min 624 thousand for weighted J, in which case, I would say that the similar genome sizes are expectable? (for k=12 it was 5 800 and 580 thousand for normal and 6 600 and 600 thousand for weighted J). And I thought that the slightly higher J compared to WJ genome sizes for some samples at k=25 could be some artefact of small (default) sketch size..and that maybe the higher number of shared kmers for normalJ compared to weightedJ could have something to do with small sketch size too. But, I still cannot explain the higher numbers of shared kmers for normalJ compared to weightedJ when I am using --use-full-khash-sets (in this case, the sketch size is irrelevant, isn't it?). And also, I am really confused with zero genomes sizes with --use-full-khash-sets --wj settings. Could you please help me with that - what am I doing wrong? I am really sorry for these lengthy descriptions and stupid questions - I am a biologist with minimal understanding of the theory behing dashing. Despite that, I would be really happy to include these results into my work, as subsequent analyses using --use-full-khash-sets "normal" Jaccard similarities gave very good biological sense. My further intention was to use weighted Jaccard with sequencing reads mapped to the metagenomic assemblies - and for that I would like to have the weighted Jaccard set correctly. Thank you very much for any insight, with best regards, Dasa.
dnbaker commented 2 years ago

Hi dejsha,

I'm sorry to hear that you're having difficulties, and I'll try to find out where they're coming from.

I've taken the latest dashing_s128 release and run a few commands from within the Dashing repository. First, I tried without --use-full-khash-sets, and it worked as expected:

[dbaker49@login02 dashing]$ find bonsai/test/ -name '*fna.gz' > fns.txt
[dbaker49@login02 dashing]$ ./dashing_s128 cmp -F fns.txt --use-full-khash-sets --wj-exact -p4  -k25
Dashing version: v1.0
#Path   Size (est.)
bonsai/test/ec/GCF_000008865.1_ASM886v1_genomic.fna.gz  5594398
bonsai/test/ec/GCF_000007445.1_ASM744v1_genomic.fna.gz  5226083
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz  5182481
bonsai/test/ec/GCF_000010245.2_ASM1024v1_genomic.fna.gz 4646303
bonsai/test/ec/GCF_000005845.2_ASM584v2_genomic.fna.gz  4641623
bonsai/test/ec/GCF_000009565.1_ASM956v1_genomic.fna.gz  4558920
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz   2683594
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz  2478048
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz   2449961
##Names bonsai/test/ec/GCF_000008865.1_ASM886v1_genomic.fna.gz  bonsai/test/ec/GCF_000007445.1_ASM744v1_genomic.fna.gz  bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz  bonsai/test/ec/GCF_000010245.2_ASM1024v1_genomic.fna.gz bonsai/test/ec/GCF_000005845.2_ASM584v2_genomic.fna.gz  bonsai/test/ec/GCF_000009565.1_ASM956v1_genomic.fna.gz  bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz   bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz  bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz
bonsai/test/ec/GCF_000008865.1_ASM886v1_genomic.fna.gz  -   0.275324    3.80446e-06 0.408161    0.408451    0.400065    3.62407e-07 2.47756e-07 2.48621e-07
bonsai/test/ec/GCF_000007445.1_ASM744v1_genomic.fna.gz  -   -   3.0744e-06  0.304341    0.304541    0.310391    3.79282e-07 2.59601e-07 2.60551e-07
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz  -   -   -   3.3575e-06  3.3591e-06  4.51682e-06 1.65269e-05 3.39414e-05 3.40663e-05
bonsai/test/ec/GCF_000010245.2_ASM1024v1_genomic.fna.gz -   -   -   -   0.995331    0.665783    4.09283e-07 2.80727e-07 2.81839e-07
bonsai/test/ec/GCF_000005845.2_ASM584v2_genomic.fna.gz  -   -   -   -   -   0.666635    4.09544e-07 2.80912e-07 2.82025e-07
bonsai/test/ec/GCF_000009565.1_ASM956v1_genomic.fna.gz  -   -   -   -   -   -   4.14221e-07 2.84213e-07 2.85352e-07
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz   -   -   -   -   -   -   -   0.0091678   0.0088866
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz  -   -   -   -   -   -   -   -   0.582643
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz   -   -   -   -   -   -   -   -   -
[dbaker49@login02 dashing]$ ./dashing_s128 cmp -F fns.txt --use-full-khash-sets --wj -p4  -k25
Dashing version: v1.0
#Path   Size (est.)
bonsai/test/ec/GCF_000008865.1_ASM886v1_genomic.fna.gz  5378246
bonsai/test/ec/GCF_000007445.1_ASM744v1_genomic.fna.gz  5118503
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz  4889417
bonsai/test/ec/GCF_000010245.2_ASM1024v1_genomic.fna.gz 4560493
bonsai/test/ec/GCF_000005845.2_ASM584v2_genomic.fna.gz  4567585
bonsai/test/ec/GCF_000009565.1_ASM956v1_genomic.fna.gz  4481182
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz   2676200
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz  2452166
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz   2424678
##Names bonsai/test/ec/GCF_000008865.1_ASM886v1_genomic.fna.gz  bonsai/test/ec/GCF_000007445.1_ASM744v1_genomic.fna.gz  bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz  bonsai/test/ec/GCF_000010245.2_ASM1024v1_genomic.fna.gz bonsai/test/ec/GCF_000005845.2_ASM584v2_genomic.fna.gz  bonsai/test/ec/GCF_000009565.1_ASM956v1_genomic.fna.gz  bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz   bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz  bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz
bonsai/test/ec/GCF_000008865.1_ASM886v1_genomic.fna.gz  -   0.169356    2.72702e-06 0.236958    0.233108    0.232379    2.4831e-07  1.27707e-07 1.28157e-07
bonsai/test/ec/GCF_000007445.1_ASM744v1_genomic.fna.gz  -   -   2.09834e-06 0.20104 0.198809    0.199656    2.56585e-07 1.32089e-07 1.3257e-07
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz  -   -   -   2.22225e-06 2.64355e-06 3.20151e-06 1.32179e-05 1.63455e-05 1.39459e-05
bonsai/test/ec/GCF_000010245.2_ASM1024v1_genomic.fna.gz -   -   -   -   0.546015    0.399542    2.76369e-07 1.42599e-07 1.4316e-07
bonsai/test/ec/GCF_000005845.2_ASM584v2_genomic.fna.gz  -   -   -   -   -   0.391206    0   0   1.43015e-07
bonsai/test/ec/GCF_000009565.1_ASM956v1_genomic.fna.gz  -   -   -   -   -   -   1.39716e-07 0   1.44805e-07
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz   -   -   -   -   -   -   -   0.00571696  0.00364102
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz  -   -   -   -   -   -   -   -   0.196344
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz   -   -   -   -   -   -   -   -   -

You're right that the weighted cardinality should be higher than the unweighted cardinality. I think part of the issue is that you're using --wj instead of -wj-exact. The --wj method can get out of sync when generating weighted sketches, and so it tends to reduce estimated similarities. --wj-exact comes at a bit of runtime cost, but it shouldn't have that problem, and it's probably the better flag to use for most applications.

I'm not sure why the cardinalities are ending up zero.. Are these assemblies compressed with something other than gzip? I think Dashing might have trouble reading bz2 and xz compressed files.

Thanks for making the issue, and I'm happy to help more.

Daniel