Previously, a user could request a desired partial cover across all input sequences with -p/--cover-frac (e.g., ensure that the designed guides hit 90% of all input sequences). These changes allow variable coverage of the input sequences according to year. The guides can then be designed so as to ensure higher coverage of more recent years; this can help reduce the number of overall guides (by requesting less coverage of sequences from more distant years) and/or guarantee higher coverage of sequences from more recent years.
This pull request makes several changes to support the above, including:
Generalizes the solution that solves an instance of the set cover problem to accept a universe divided into groups. Here, each group represents sequences from a year. Each group can have a desired partial cover: the fraction of sequences in that group that the guides are guaranteed to hit. The sets (guides) are optimally selected from the entire universe so as to guarantee the desired coverage of each group.
Modifies the method to construct a guide (along with its memoizer) to account for the fact that there are sequences from separate groups that each have their own desired partial cover. (Taking a consensus across the largest cluster, as was previously done, may be sub-optimal in the case where there are a large number of sequences from a group with little desired coverage.)
Adds a --cover-by-year-decay argument to the design_guides.py executable. This argument reads in a year for each input sequence, as well as parameters governing the desired coverage across years. These parameters specify a desired coverage for each year that decays exponentially going back in time -- i.e., the partial cover of sequences from year N will be a constant factor (<1) of the partial cover of sequences from year N+1.
Previously, a user could request a desired partial cover across all input sequences with
-p
/--cover-frac
(e.g., ensure that the designed guides hit 90% of all input sequences). These changes allow variable coverage of the input sequences according to year. The guides can then be designed so as to ensure higher coverage of more recent years; this can help reduce the number of overall guides (by requesting less coverage of sequences from more distant years) and/or guarantee higher coverage of sequences from more recent years.This pull request makes several changes to support the above, including:
--cover-by-year-decay
argument to thedesign_guides.py
executable. This argument reads in a year for each input sequence, as well as parameters governing the desired coverage across years. These parameters specify a desired coverage for each year that decays exponentially going back in time -- i.e., the partial cover of sequences from year N will be a constant factor (<1) of the partial cover of sequences from year N+1.