Closed priyappillai closed 2 years ago
Merging #63 (81f4b18) into main (6527d50) will increase coverage by
0.55%
. The diff coverage is91.91%
.
@@ Coverage Diff @@
## main #63 +/- ##
==========================================
+ Coverage 86.30% 86.85% +0.55%
==========================================
Files 50 52 +2
Lines 8219 8649 +430
==========================================
+ Hits 7093 7512 +419
- Misses 1126 1137 +11
Impacted Files | Coverage Δ | |
---|---|---|
bin/design.py | 65.48% <60.46%> (-0.34%) |
:arrow_down: |
adapt/utils/seq_io.py | 35.77% <78.57%> (+2.93%) |
:arrow_up: |
adapt/prepare/ncbi_neighbors.py | 55.96% <85.45%> (+12.62%) |
:arrow_up: |
adapt/prepare/prepare_alignment.py | 67.91% <88.23%> (+4.27%) |
:arrow_up: |
adapt/utils/weight.py | 93.75% <93.75%> (ø) |
|
adapt/alignment.py | 95.86% <96.87%> (+0.09%) |
:arrow_up: |
adapt/prepare/tests/test_ncbi_neighbors.py | 99.05% <98.57%> (-0.95%) |
:arrow_down: |
adapt/guide_search.py | 84.71% <100.00%> (ø) |
|
adapt/prepare/align.py | 66.15% <100.00%> (ø) |
|
adapt/tests/test_alignment.py | 99.68% <100.00%> (+0.03%) |
:arrow_up: |
... and 5 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 6527d50...81f4b18. Read the comment docs.
ADAPT currently bases how much coverage it has of a virus on the percent of genomic sequences in the database that are covered. While this works well if the sequences represent a random sample of the population, this is not always the case due to sampling biases.
For the case when ADAPT is designing an assay across a taxon with multiple subtaxa, each with different levels of sampling, ADAPT can create assays that only cover a highly overrepresented subtaxon and no other subtaxa. While having more sequences in the database is representative of a subtaxon's relative importance, it should not cause other subtaxa to be treated as unimportant.
This PR introduces weighting sequences, both automatically and manually, to help account for this problem. Manual weights can be input with
--weight-sequences
, and weights can be determined automatically with--weight-by-log-size-of-subtaxa
. Automatic weighting uses the log of the number of sequences within a subtaxa as a heuristic of how important that group is.adapt/utils/weight.py
normalize
percentile
(weighted)weight_by_log_group
adapt/utils/tests/test_weight.py
adapt/utils/seq_io.py
read_sequence_weights
adapt/alignment.py
seq_idxs_weighted
function to get the sum of the weights of a subset of sequencesconstruct_guide
, changenum_needed
topercent_needed
determine_consensus_sequence
/position_entropy
/base_percentages
, use weights to calculate base percentages rather than countsadapt/tests/test_alignment.py
seq_idx_weighted
construct_guide
/determine_consensus_sequence
/position_entropy
/base_percentages
/determine_most_common_sequences
adapt/prepare/align.py
curate_against_ref
adapt/prepare/prepare_alignment.py
adapt/prepare/ncbi_neighbors.py
ncbi_detail_taxonomy_url
ncbi_search_taxonomy_url
parse_taxonomy_xml_for_taxid
parse_taxonomy_xml_for_rank
fetch_taxonomies
get_taxid
get_rank
get_subtaxa_groups
adapt/prepare/tests/test_ncbi_neighbors.py
adapt/guide_search.py
guide_set_activities_expected_value
/guide_set_activities_expected_value_per_guide
/guide_activities_expected_value
/obj_value
/_ground_set_with_activities_memoized
/_analyze_guides
/_analyze_guides_memoized
/`add_guide, use weighted average_construct_guide_memoized
/_find_optimal_guide
/_find_guides_in_window
/add_guide_to_cover
/score_collection_of_guides
, use percent_needed rather than num_neededtotal_frac_bound_by_guides
, use weighted total (usingseq_idxs_weighted
)adapt/tests/test_guide_search.py
bin/design.py
--weight-by-log-size-of-subtaxa
to be passed toprepare_alignment.prepare_for
bin/tests/test_design.py