Closed matnguyen closed 1 year ago
Thanks for the file! I'll give it a shot.
How many sequences are in the file? Once you go past 100k, then the default all-pairs distance calculation gets pretty expensive because of the O(n^2)
comparisons because of the n-choose-2 pairwise comparisons that are performed.
It's true - if you're performing sketch
in --parse-by-seq
mode, it doesn't have anywhere to put the sketches by default.
You can have it write them to disk by adding -o sketch_file.bin
. With 65k sequences, though, I would expect it to run faster.
I notice that --cmpout
is a folder path; it's possible that it's stuck trying to write to a directory.
I'll update you with progress. And I appreciate the smaller xz file, it's what I usually share with.
Found the bug: my recent attempt to improve memory usage introduced a memory management error.
I've fixed it in https://github.com/dnbaker/dashing2/pull/83, and the release is tagged as v2.1.19
. You can build from source, or wait until later today when I have new binaries out.
Sorry for the wait!
Binaries are up. Let me know if you have any further issues - thanks!
Works great now, thanks!
The following Fasta file never completes (has been running for 5 days). The following command was used:
dashing2_binaries/linux/v2.1.16/dashing2_savx2 sketch --cmpout ../ab_data/ -k 7 --parse-by-seq -p 8 ../ab_data/ab.fa
ab.fa.xz.gz
If trying to only do sketching, it completes but does not output any sketch file
dashing2_binaries/linux/v2.1.16/dashing2_savx2 sketch -k 7 --parse-by-seq -p 8 ../ab_data/ab.fa --cache --prefix ../ab_data/
Sorry about the weird file format, xz is better for compression but Github won't let me upload them.