broadinstitute / adapt

A package for designing activity-informed nucleic acid diagnostics for viruses.
MIT License
27 stars 1 forks source link

Add sequence weighting #63

Closed priyappillai closed 2 years ago

priyappillai commented 2 years ago

ADAPT currently bases how much coverage it has of a virus on the percent of genomic sequences in the database that are covered. While this works well if the sequences represent a random sample of the population, this is not always the case due to sampling biases.

For the case when ADAPT is designing an assay across a taxon with multiple subtaxa, each with different levels of sampling, ADAPT can create assays that only cover a highly overrepresented subtaxon and no other subtaxa. While having more sequences in the database is representative of a subtaxon's relative importance, it should not cause other subtaxa to be treated as unimportant.

This PR introduces weighting sequences, both automatically and manually, to help account for this problem. Manual weights can be input with --weight-sequences, and weights can be determined automatically with --weight-by-log-size-of-subtaxa. Automatic weighting uses the log of the number of sequences within a subtaxa as a heuristic of how important that group is.

adapt/utils/weight.py

adapt/utils/tests/test_weight.py

adapt/utils/seq_io.py

adapt/alignment.py

adapt/tests/test_alignment.py

adapt/prepare/align.py

adapt/prepare/prepare_alignment.py

adapt/prepare/ncbi_neighbors.py

adapt/prepare/tests/test_ncbi_neighbors.py

adapt/guide_search.py

adapt/tests/test_guide_search.py

bin/design.py

bin/tests/test_design.py

codecov[bot] commented 2 years ago

Codecov Report

Merging #63 (81f4b18) into main (6527d50) will increase coverage by 0.55%. The diff coverage is 91.91%.

@@            Coverage Diff             @@
##             main      #63      +/-   ##
==========================================
+ Coverage   86.30%   86.85%   +0.55%     
==========================================
  Files          50       52       +2     
  Lines        8219     8649     +430     
==========================================
+ Hits         7093     7512     +419     
- Misses       1126     1137      +11     
Impacted Files Coverage Δ
bin/design.py 65.48% <60.46%> (-0.34%) :arrow_down:
adapt/utils/seq_io.py 35.77% <78.57%> (+2.93%) :arrow_up:
adapt/prepare/ncbi_neighbors.py 55.96% <85.45%> (+12.62%) :arrow_up:
adapt/prepare/prepare_alignment.py 67.91% <88.23%> (+4.27%) :arrow_up:
adapt/utils/weight.py 93.75% <93.75%> (ø)
adapt/alignment.py 95.86% <96.87%> (+0.09%) :arrow_up:
adapt/prepare/tests/test_ncbi_neighbors.py 99.05% <98.57%> (-0.95%) :arrow_down:
adapt/guide_search.py 84.71% <100.00%> (ø)
adapt/prepare/align.py 66.15% <100.00%> (ø)
adapt/tests/test_alignment.py 99.68% <100.00%> (+0.03%) :arrow_up:
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 6527d50...81f4b18. Read the comment docs.