Issue #24 gives background and reasons for this feature. In short, design.py can be slow when given a large number of highly divergent sequences (e.g., all sequences for all eight segments of influenza A virus). One solution is to cluster input sequences (alignment-free), solve a separate set cover instance on each cluster, and then merge the output probes from each cluster.
This PR adds the argument --cluster-and-design-separately. When provided, it produces a signature (or "sketch") of each input sequence using MinHash (similar to what is done in Mash) and clusters the sequences by comparing their signatures. Clustering itself can be slow and memory-intensive, but using signatures enables fast pairwise comparison of sequences. Then, it both generates candidate probes independently on each cluster and runs a collection of filters on those candidate probes (typically including set_cover_filter) independently on each cluster. It merges the resulting probes (removing exact duplicates), and runs final filters (e.g., adapter_filter) on the merged set of probes.
Depending on the resource requirements of clustering, this can generally improve runtime and memory usage overall because solving independent, smaller set cover instances requires fewer resources than solving the complete one. One downside is that this can increase the size of the resulting probe set (e.g., if there is homology between input sequences that are placed into different clusters).
This PR also adds the argument --cluster-from-fragments, to improve runtime and memory usage on long genomes. When set, this chops input sequences according to some specified fragment length, clusters the fragments, and designs probes independently on each cluster of fragments. The effect might be, for example, that each cluster consists of a gene (or fixed-length piece of a gene). This can help when the input consists of relatively long genomes (e.g., bacterial or DNA virus) because different parts of the genome can be designed, to some extent, independently.
Coverage increased (+0.02%) to 95.127% when pulling c6fd1f3d3bef002e411ceb1f9e8c9219d3ae7cab on cluster-genomes into 6f448eb775e28361184cf3df81ed2b6899e863ae on master.
This PR addresses #24.
Issue #24 gives background and reasons for this feature. In short,
design.py
can be slow when given a large number of highly divergent sequences (e.g., all sequences for all eight segments of influenza A virus). One solution is to cluster input sequences (alignment-free), solve a separate set cover instance on each cluster, and then merge the output probes from each cluster.This PR adds the argument
--cluster-and-design-separately
. When provided, it produces a signature (or "sketch") of each input sequence using MinHash (similar to what is done in Mash) and clusters the sequences by comparing their signatures. Clustering itself can be slow and memory-intensive, but using signatures enables fast pairwise comparison of sequences. Then, it both generates candidate probes independently on each cluster and runs a collection of filters on those candidate probes (typically includingset_cover_filter
) independently on each cluster. It merges the resulting probes (removing exact duplicates), and runs final filters (e.g.,adapter_filter
) on the merged set of probes.Depending on the resource requirements of clustering, this can generally improve runtime and memory usage overall because solving independent, smaller set cover instances requires fewer resources than solving the complete one. One downside is that this can increase the size of the resulting probe set (e.g., if there is homology between input sequences that are placed into different clusters).
This PR also adds the argument
--cluster-from-fragments
, to improve runtime and memory usage on long genomes. When set, this chops input sequences according to some specified fragment length, clusters the fragments, and designs probes independently on each cluster of fragments. The effect might be, for example, that each cluster consists of a gene (or fixed-length piece of a gene). This can help when the input consists of relatively long genomes (e.g., bacterial or DNA virus) because different parts of the genome can be designed, to some extent, independently.