andersen-lab / bjorn

GNU General Public License v3.0
20 stars 4 forks source link

Issue22 fix #24

Closed AlaaALatif closed 3 years ago

AlaaALatif commented 3 years ago

This PR attempts to close issues #21 and #22, which are regarding the identification of point mutations (substitutions) that result from out-of-frame deletions.

Testing

  1. Sourced from GISAID, 6 sequences were randomly sampled from each of the following lineages: B.1.617.2; C.37; and B.1.1.7, amounting to a total of 18 sequences (plus the reference sequence NC045512).
  2. An alignment of all samples was generated using minimap2 and datafunk
  3. Mutations for each of the samples were computed using msa_2_mutations.py before and after the code changes
  4. Comparisons between the outputs were made to confirm the correct expected behavior after code changes

As a result, the outputs were found to be identical except for the following mutations, which were identified only after the code changes: 'ORF1a:S3675K', 'S:E156G', 'S:I68I', 'S:R246N', 'ORF8:D119I',

By looking at the out-of-frame deletions that were identified: 'ORF1a:DEL3675.3/3678.3', 'S:DEL68.7/70.7', 'S:DEL156.7/158.7', 'S:DEL246.7/253.7', 'ORF8:DEL119.3/121.3'

This indicates that all differences observed were due to additional substitutions that were a result of out-of-frame (non-frameshifting) deletions.

Implementation

In order to implement these changes, a new and corrected numbering system had to be used for assigning deletions. In essence, the start codon of any deletion has value x.y where x is the codon number where the deletion starts, while y=0 if the deletion starts at the beginning of the codon, y=3 if it starts in the middle, and y=7 if it starts in the end of the codon. The same is followed in assigning the end codon of the deletion. These corrected coordinates are stored in a new column named deletion_name, while keeping the original naming convention under mutation to keep it simple for users to search for in outbreak.info.