Cluster input sequences by producing signatures of them

Sometimes when the input data to design.py includes large numbers of divergent sequence (e.g., all sequences for all eight segments of Influenza A virus), solving the set cover instance (here) can be slow. One solution is to run design.py separately on clusters that can be easily identified (e.g., each run contains all sequences for one segment of Influenza A virus), and then pool the results either on the same choice of parameter values or from a search.

In other cases, a user may want to input all divergent sequences to design.py. For cases like this, there is no need to solve a single set cover instance across all sequences. Instead, we can cluster the input sequences and solve a separate instance on each cluster, and then pool the resulting probes. This may slightly raise the size of the final output (if there is some homology and shared probe sequences between the clusters), but could dramatically improve runtime.

Ideally the clustering should be alignment-free. One option is to produce a signature (or "sketch") of each input sequence using MinHash (as done in Mash) or HyperLogLog (as done in Dashing). Then, the sequences can be clustered using these signatures.

Relatedly, an option to reduce runtime on long genomes (even in cases where the input sequences are not completely divergent) is to chop across the sequences, cluster these fragments, and solve a separate instance on each cluster.

broadinstitute / catch

Cluster input sequences by producing signatures of them #24