Sketching/comparing never completes on specific Fasta file

dnbaker / dashing2

Dashing 2 is a fast toolkit for k-mer and minimizer encoding, sketching, comparison, and indexing.

MIT License

62 stars 7 forks source link

Sketching/comparing never completes on specific Fasta file #82

Closed matnguyen closed 1 year ago

matnguyen commented 1 year ago

The following Fasta file never completes (has been running for 5 days). The following command was used: dashing2_binaries/linux/v2.1.16/dashing2_savx2 sketch --cmpout ../ab_data/ -k 7 --parse-by-seq -p 8 ../ab_data/ab.fa

ab.fa.xz.gz

If trying to only do sketching, it completes but does not output any sketch file dashing2_binaries/linux/v2.1.16/dashing2_savx2 sketch -k 7 --parse-by-seq -p 8 ../ab_data/ab.fa --cache --prefix ../ab_data/

Sorry about the weird file format, xz is better for compression but Github won't let me upload them.

dnbaker commented 1 year ago

Thanks for the file! I'll give it a shot.

How many sequences are in the file? Once you go past 100k, then the default all-pairs distance calculation gets pretty expensive because of the O(n^2) comparisons because of the n-choose-2 pairwise comparisons that are performed.

It's true - if you're performing sketch in --parse-by-seq mode, it doesn't have anywhere to put the sketches by default.

You can have it write them to disk by adding -o sketch_file.bin. With 65k sequences, though, I would expect it to run faster.

I notice that --cmpout is a folder path; it's possible that it's stuck trying to write to a directory.

I'll update you with progress. And I appreciate the smaller xz file, it's what I usually share with.

dnbaker commented 1 year ago

Found the bug: my recent attempt to improve memory usage introduced a memory management error.

I've fixed it in https://github.com/dnbaker/dashing2/pull/83, and the release is tagged as v2.1.19. You can build from source, or wait until later today when I have new binaries out.

Sorry for the wait!

dnbaker commented 1 year ago

Binaries are up. Let me know if you have any further issues - thanks!

matnguyen commented 1 year ago

Works great now, thanks!