broadinstitute / adapt

A package for designing activity-informed nucleic acid diagnostics for viruses.
MIT License
27 stars 1 forks source link

Check for reverse complement sequences when aligning #62

Closed priyappillai closed 2 years ago

priyappillai commented 2 years ago

On occasion, the sequences in NCBI are uploaded in the reverse complement direction of the reference sequence. To account for this, MAFFT uses --adjustdirection as input to check the reverse complement of every sequence based on the first one. This does adjust how reference sequences are used during curation, as the reverse complement of the sequence being checked is now being stored at that point.

codecov[bot] commented 2 years ago

Codecov Report

Merging #62 (b3623e0) into main (f045cdf) will decrease coverage by 0.06%. The diff coverage is 67.56%.

@@            Coverage Diff             @@
##             main      #62      +/-   ##
==========================================
- Coverage   86.81%   86.74%   -0.07%     
==========================================
  Files          52       52              
  Lines        8685     8708      +23     
==========================================
+ Hits         7540     7554      +14     
- Misses       1145     1154       +9     
Impacted Files Coverage Δ
adapt/prepare/align.py 64.77% <16.66%> (-1.39%) :arrow_down:
adapt/prepare/prepare_alignment.py 68.93% <90.90%> (+1.01%) :arrow_up:
adapt/prepare/tests/test_align.py 90.47% <100.00%> (ø)
bin/tests/test_design.py 99.62% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update f045cdf...b3623e0. Read the comment docs.

priyappillai commented 2 years ago

Needed to modify the method of making sure the sequences match the reference direction. I was storing the correct direction sequence during curation, but I realized that would not work with alignment stats memoization (as that information isn't stored). Rather than change alignment stats memoization, I went back to the previous strategy we discussed of making sure the reference sequence is the first sequence given to MAFFT.

Also, in regards to:

  1. Is it the case that we never actually used remove_ref_accs except in test cases?

remove_ref_accs used to be used before the annotations PR (#53), but that one required reference accessions to be kept until after alignment/annotations were added. I probably should have removed it then.