dnbaker / dashing

Fast and accurate genomic distances using HyperLogLog
GNU General Public License v3.0
160 stars 11 forks source link

README clarification & performance #49

Closed rchikhi closed 4 years ago

rchikhi commented 4 years ago

Hi again, sorry to only be leaving 'negative' issues, but my dashing run has been going on for a day (2 threads) so I suppose I did something wrong when following the README instructions. I have 15k datasets, each dataset is neither a complete genome nor a set of reads but a .fa.gz human transcriptome assembly. 1) I went with the first cmdline of the README: dashing dist -k31 -p2. Is this an appropriate one for my setting? 2) README says "unspaced, unminimized kmers". But these are still minimizers, right? 3) If I understand correctly, the "dist" section is for all-against-all comparison, and the "dist (asymmetric mode)" section is all A's versus all B's comparison, right?

thanks in advance, Rayan

dnbaker commented 4 years ago

Hi Rayan,

Thanks again for checking.

  1. dashing -k31 -p2 will do distances with k = 31 and 2 processes, which should be. Can I see the command-line involved?
  2. Minimized means that only minimizers from sliding windows are sketched. Technically, unminimized kmers are minimizers for sets of size 1.
  3. Correct, dist (by default) is all pairwise comparisons within a set. The asymmetric mode is enabled by both -F and -Q, e.g., dashing dist <args> -F path_to_A_genomes.txt -Q path_to_B_genomes.txt. The asymmetric measures (such as containment) are only available in asymmetric mode.

This might be the wrong direction to look, I wonder if maybe it's hanging due to how arguments are passed. They can be specified by optional arguments (dashing <opts> g1.fa g2.fa g3.fa ...) or by the -F flag dashing <opts> -F paths.txt

On the other hand, it's possible that the transcriptome assemblies are rather large and sketching is just taking a while with only 2 threads. If you re-run dist with the -W/--cache-sketches argument, dashing will cache sketches to files adjacent to the input files, which you could use to monitor progress. (e.g., dashing dist -k17 might leave a cached file of GCF_000762265.1_ASM76226v1_genomic.fna.gz.w.17.spacing.10.hll for GCF_000762265.1_ASM76226v1_genomic.fna.gz.)

My last thought is that strandedness may matter for your application (transcriptomes). By default (with genomes as a primary application), k-mers are canonicalized, but for RNA, you may want to use -C/ --no-canon.

Does this help?

Thanks!

Daniel

rchikhi commented 4 years ago

Thanks for the comprehensive answer. Sure, the full cmdline is \time dashing dist -k31 -p2 -Odistance_matrix.txt -osize_estimates.txt -F list_unitigs With list_unitigs being:

../data/DRR017562.unitigs.fa.gz
../data/DRR017563.unitigs.fa.gz
../data/DRR017566.unitigs.fa.gz
../data/DRR017598.unitigs.fa.gz
...`

Dashing isn't hanging, it is currently running at 190% as expected. They're all rather large indeed. Around 300 MBp or even up to 1-5 GB per file. That could be it! Maybe it's still at the sketching phase. I'll stop and re-run with -W. Also the unitigs are unstranded so I'll leave it without -C.

rchikhi commented 4 years ago

Also regarding point 2: oh then I understand. Dashing doesn't use minimizers (by default).

rchikhi commented 4 years ago

OK I could see that in 4 hours it created the sketches for 1% of my dataset, probably my dataset is too big :) thanks for the help.

rchikhi commented 4 years ago

I tried it on a smaller set of just 10 datasets and it finished in 2 minutes, all good! 👍

dnbaker commented 4 years ago

Great!

Out of curiosity, how large are these assemblies? Edit, nevermind: I saw your answer above.