broadinstitute / catch

A package for designing compact and comprehensive capture probe sets.
MIT License
76 stars 16 forks source link

Cluster input sequences by producing signatures of them #24

Closed haydenm closed 5 years ago

haydenm commented 5 years ago

Sometimes when the input data to design.py includes large numbers of divergent sequence (e.g., all sequences for all eight segments of Influenza A virus), solving the set cover instance (here) can be slow. One solution is to run design.py separately on clusters that can be easily identified (e.g., each run contains all sequences for one segment of Influenza A virus), and then pool the results either on the same choice of parameter values or from a search.

In other cases, a user may want to input all divergent sequences to design.py. For cases like this, there is no need to solve a single set cover instance across all sequences. Instead, we can cluster the input sequences and solve a separate instance on each cluster, and then pool the resulting probes. This may slightly raise the size of the final output (if there is some homology and shared probe sequences between the clusters), but could dramatically improve runtime.

Ideally the clustering should be alignment-free. One option is to produce a signature (or "sketch") of each input sequence using MinHash (as done in Mash) or HyperLogLog (as done in Dashing). Then, the sequences can be clustered using these signatures.

Relatedly, an option to reduce runtime on long genomes (even in cases where the input sequences are not completely divergent) is to chop across the sequences, cluster these fragments, and solve a separate instance on each cluster.

haydenm commented 5 years ago

This is implemented in #25.