dnbaker / dashing2

Dashing 2 is a fast toolkit for k-mer and minimizer encoding, sketching, comparison, and indexing.
MIT License
62 stars 7 forks source link

Segmentation fault error #77

Open kaysahu opened 1 year ago

kaysahu commented 1 year ago

Hi, I use dashing2 on two different files:

1) Genome file: Used dashing 2 basic sketch commands, which nearly took 2 days for processing.

$repo/dashing2 sketch --parse-by-seq --cmpout $outfile $genome

Can you suggest something for improving runtime efficiency?

2) Protein file: This input file is converted from the genome file in (1), and after processing for 96 hours, I get this error:

image

Can you please advise regarding this error?

dnbaker commented 1 year ago

Hi -

I suggest generating a sparse distance matrix when the number of entries is greater than about 50,000.

You can make it sparse by choosing a minimum similarity (e.g., --similarity-threshold 0.8) or top-k (e.g., --topk 250). Dashing2 then indexes the data and only performs comparisons against near neighbors retrieved from the index.

You can parse the output files yourself, or you can choose binary output and use parsing code from dashing2/python/parse.py. Either way, once you have the sparse matrix, you can feed it into HDBSCAN to cluster quickly.

The alternative is to cluster directly in dashing2 using --greedy <similarity_threshold>, which puts all sequences above threshold into the same cluster and only compares new sequences against the largest sequence in the cluster. It's quick though not as high quality as HDBSCAN can generate.

For genomes, all-pairs is good enough for a lot of applications, but at the level of reads or transcripts you often need to use the sparse computation modes.

Happy to answer more questions, and good luck!

Daniel