Closed rchikhi closed 4 years ago
Hi Rayan,
Thanks again for checking.
unminimized
kmers are minimizers for sets of size 1.dashing dist <args> -F path_to_A_genomes.txt -Q path_to_B_genomes.txt
. The asymmetric measures (such as containment) are only available in asymmetric mode.This might be the wrong direction to look, I wonder if maybe it's hanging due to how arguments are passed. They can be specified by optional arguments (dashing <opts> g1.fa g2.fa g3.fa ...
) or by the -F flag dashing <opts> -F paths.txt
On the other hand, it's possible that the transcriptome assemblies are rather large and sketching is just taking a while with only 2 threads. If you re-run dist with the -W
/--cache-sketches
argument, dashing will cache sketches to files adjacent to the input files, which you could use to monitor progress. (e.g., dashing dist -k17
might leave a cached file of GCF_000762265.1_ASM76226v1_genomic.fna.gz.w.17.spacing.10.hll
for GCF_000762265.1_ASM76226v1_genomic.fna.gz
.)
My last thought is that strandedness may matter for your application (transcriptomes). By default (with genomes as a primary application), k-mers are canonicalized, but for RNA, you may want to use -C
/ --no-canon
.
Does this help?
Thanks!
Daniel
Thanks for the comprehensive answer. Sure, the full cmdline is \time dashing dist -k31 -p2 -Odistance_matrix.txt -osize_estimates.txt -F list_unitigs
With list_unitigs
being:
../data/DRR017562.unitigs.fa.gz
../data/DRR017563.unitigs.fa.gz
../data/DRR017566.unitigs.fa.gz
../data/DRR017598.unitigs.fa.gz
...`
Dashing isn't hanging, it is currently running at 190% as expected.
They're all rather large indeed. Around 300 MBp or even up to 1-5 GB per file. That could be it! Maybe it's still at the sketching phase. I'll stop and re-run with -W
. Also the unitigs are unstranded so I'll leave it without -C
.
Also regarding point 2: oh then I understand. Dashing doesn't use minimizers (by default).
OK I could see that in 4 hours it created the sketches for 1% of my dataset, probably my dataset is too big :) thanks for the help.
I tried it on a smaller set of just 10 datasets and it finished in 2 minutes, all good! 👍
Great!
Out of curiosity, how large are these assemblies? Edit, nevermind: I saw your answer above.
Hi again, sorry to only be leaving 'negative' issues, but my dashing run has been going on for a day (2 threads) so I suppose I did something wrong when following the README instructions. I have 15k datasets, each dataset is neither a complete genome nor a set of reads but a
.fa.gz
human transcriptome assembly. 1) I went with the first cmdline of the README:dashing dist -k31 -p2
. Is this an appropriate one for my setting? 2) README says "unspaced, unminimized kmers". But these are still minimizers, right? 3) If I understand correctly, the "dist" section is for all-against-all comparison, and the "dist (asymmetric mode)" section is all A's versus all B's comparison, right?thanks in advance, Rayan