This PR attempts to close issues #21 and #22, which are regarding the identification of point mutations (substitutions) that result from out-of-frame deletions.
Testing
Sourced from GISAID, 6 sequences were randomly sampled from each of the following lineages: B.1.617.2; C.37; and B.1.1.7, amounting to a total of 18 sequences (plus the reference sequence NC045512).
An alignment of all samples was generated using minimap2 and datafunk
Mutations for each of the samples were computed using msa_2_mutations.pybefore and after the code changes
Comparisons between the outputs were made to confirm the correct expected behavior after code changes
As a result, the outputs were found to be identical except for the following mutations, which were identified only after the code changes:
'ORF1a:S3675K', 'S:E156G', 'S:I68I', 'S:R246N', 'ORF8:D119I',
By looking at the out-of-frame deletions that were identified:
'ORF1a:DEL3675.3/3678.3', 'S:DEL68.7/70.7', 'S:DEL156.7/158.7', 'S:DEL246.7/253.7', 'ORF8:DEL119.3/121.3'
This indicates that all differences observed were due to additional substitutions that were a result of out-of-frame (non-frameshifting) deletions.
Implementation
In order to implement these changes, a new and corrected numbering system had to be used for assigning deletions. In essence, the start codon of any deletion has value x.y where x is the codon number where the deletion starts, while y=0 if the deletion starts at the beginning of the codon, y=3 if it starts in the middle, and y=7 if it starts in the end of the codon. The same is followed in assigning the end codon of the deletion. These corrected coordinates are stored in a new column named deletion_name, while keeping the original naming convention under mutation to keep it simple for users to search for in outbreak.info.
This PR attempts to close issues #21 and #22, which are regarding the identification of point mutations (substitutions) that result from out-of-frame deletions.
Testing
minimap2
anddatafunk
msa_2_mutations.py
before and after the code changesAs a result, the outputs were found to be identical except for the following mutations, which were identified only after the code changes:
'ORF1a:S3675K', 'S:E156G', 'S:I68I', 'S:R246N', 'ORF8:D119I',
By looking at the out-of-frame deletions that were identified:
'ORF1a:DEL3675.3/3678.3', 'S:DEL68.7/70.7', 'S:DEL156.7/158.7', 'S:DEL246.7/253.7', 'ORF8:DEL119.3/121.3'
This indicates that all differences observed were due to additional substitutions that were a result of out-of-frame (non-frameshifting) deletions.
Implementation
In order to implement these changes, a new and corrected numbering system had to be used for assigning deletions. In essence, the start codon of any deletion has value
x.y
wherex
is the codon number where the deletion starts, whiley=0
if the deletion starts at the beginning of the codon,y=3
if it starts in the middle, andy=7
if it starts in the end of the codon. The same is followed in assigning the end codon of the deletion. These corrected coordinates are stored in a new column nameddeletion_name
, while keeping the original naming convention undermutation
to keep it simple for users to search for in outbreak.info.