Segmentation fault error

Hi -

I suggest generating a sparse distance matrix when the number of entries is greater than about 50,000.

You can make it sparse by choosing a minimum similarity (e.g., --similarity-threshold 0.8) or top-k (e.g., --topk 250). Dashing2 then indexes the data and only performs comparisons against near neighbors retrieved from the index.

You can parse the output files yourself, or you can choose binary output and use parsing code from dashing2/python/parse.py. Either way, once you have the sparse matrix, you can feed it into HDBSCAN to cluster quickly.

The alternative is to cluster directly in dashing2 using --greedy <similarity_threshold>, which puts all sequences above threshold into the same cluster and only compares new sequences against the largest sequence in the cluster. It's quick though not as high quality as HDBSCAN can generate.

For genomes, all-pairs is good enough for a lot of applications, but at the level of reads or transcripts you often need to use the sparse computation modes.

Happy to answer more questions, and good luck!

Daniel

dnbaker / dashing2

Segmentation fault error #77