ADAPT currently bases how much coverage it has of a virus on the percent of genomic sequences in the database that are covered. While this works well if the sequences represent a random sample of the population, this is not always the case due to sampling biases.

For the case when ADAPT is designing an assay across a taxon with multiple subtaxa, each with different levels of sampling, ADAPT can create assays that only cover a highly overrepresented subtaxon and no other subtaxa. While having more sequences in the database is representative of a subtaxon's relative importance, it should not cause other subtaxa to be treated as unimportant.

This PR introduces weighting sequences, both automatically and manually, to help account for this problem. Manual weights can be input with --weight-sequences, and weights can be determined automatically with --weight-by-log-size-of-subtaxa. Automatic weighting uses the log of the number of sequences within a subtaxa as a heuristic of how important that group is.

adapt/utils/weight.py

Create file
Add functions
- normalize
- percentile (weighted)
- weight_by_log_group

adapt/utils/tests/test_weight.py

Create file

adapt/utils/seq_io.py

Add read_sequence_weights

adapt/alignment.py

Add normalized weights as a property of Alignment objects
Add seq_idxs_weighted function to get the sum of the weights of a subset of sequences
In construct_guide, change num_needed to percent_needed
In determine_consensus_sequence/position_entropy/base_percentages, use weights to calculate base percentages rather than counts
In determine_most_common_sequences, use weights for sequence percentages rather than counts

adapt/tests/test_alignment.py

Add tests for seq_idx_weighted
Add tests for weighted construct_guide/determine_consensus_sequence/position_entropy/base_percentages/determine_most_common_sequences

adapt/prepare/align.py

Move reference accession check to before the loop in curate_against_ref

adapt/prepare/prepare_alignment.py

Change prepare alignments to use 0 for the 'tax ID' and the unaligned FASTA file name for the 'segment'
Output weights

adapt/prepare/ncbi_neighbors.py

Add functions:
- ncbi_detail_taxonomy_url
- ncbi_search_taxonomy_url
- parse_taxonomy_xml_for_taxid
- parse_taxonomy_xml_for_rank
- fetch_taxonomies
- get_taxid
- get_rank
- get_subtaxa_groups

adapt/prepare/tests/test_ncbi_neighbors.py

Test NCBI taxonomy url generation
Test fetching taxonomies, getting rank, and getting subtaxa groups

adapt/guide_search.py

In guide_set_activities_percentile, use weighted percentile
In guide_set_activities_expected_value/guide_set_activities_expected_value_per_guide/guide_activities_expected_value/obj_value/_ground_set_with_activities_memoized/_analyze_guides/_analyze_guides_memoized/`add_guide, use weighted average
In _construct_guide_memoized/_find_optimal_guide/_find_guides_in_window/add_guide_to_cover/score_collection_of_guides, use percent_needed rather than num_needed
In total_frac_bound_by_guides, use weighted total (using seq_idxs_weighted)

adapt/tests/test_guide_search.py

Modify tests to use percent needed rather than num needed
Use almost equal for percent needed for floating points issues

bin/design.py

Change prepare alignments to use 0 for the 'tax ID' and the unaligned FASTA file name for the 'segment'
Set up preparation steps to also include outputting weights
Add argument --weight-by-log-size-of-subtaxa to be passed to prepare_alignment.prepare_for

bin/tests/test_design.py

Add weighted argument type
Test weighted FASTA input
Test auto weighted sequences

Codecov Report

Merging #63 (81f4b18) into main (6527d50) will increase coverage by 0.55%. The diff coverage is 91.91%.

@@            Coverage Diff             @@
##             main      #63      +/-   ##
==========================================
+ Coverage   86.30%   86.85%   +0.55%     
==========================================
  Files          50       52       +2     
  Lines        8219     8649     +430     
==========================================
+ Hits         7093     7512     +419     
- Misses       1126     1137      +11

Impacted Files	Coverage Δ
bin/design.py	`65.48% <60.46%> (-0.34%)`	:arrow_down:
adapt/utils/seq_io.py	`35.77% <78.57%> (+2.93%)`	:arrow_up:
adapt/prepare/ncbi_neighbors.py	`55.96% <85.45%> (+12.62%)`	:arrow_up:
adapt/prepare/prepare_alignment.py	`67.91% <88.23%> (+4.27%)`	:arrow_up:
adapt/utils/weight.py	`93.75% <93.75%> (ø)`
adapt/alignment.py	`95.86% <96.87%> (+0.09%)`	:arrow_up:
adapt/prepare/tests/test_ncbi_neighbors.py	`99.05% <98.57%> (-0.95%)`	:arrow_down:
adapt/guide_search.py	`84.71% <100.00%> (ø)`
adapt/prepare/align.py	`66.15% <100.00%> (ø)`
adapt/tests/test_alignment.py	`99.68% <100.00%> (+0.03%)`	:arrow_up:
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 6527d50...81f4b18. Read the comment docs.

broadinstitute / adapt

Add sequence weighting #63

Codecov Report