eringill / chronic_infection_python

A simple GUI that allows users to check whether mutations from a SARS-CoV-2 genome best fit a mutational distribution of genomes derived from global, chronic, or deer infections.
https://eringill.shinyapps.io/covid_mutation_distributions/
MIT License
2 stars 1 forks source link
covid-19 mutation-analysis sars-cov-2 zoonoses

chronic_infection_python

Overview

This application was developed by the Computational Analysis, Modelling and Evolutionary Outcomes (CAMEO) pillar of Canada's Coronavirus Variants Rapid Response Network (CoVaRR-Net). Data analysis, code and maintenance of the application are conducted by Erin E. Gill, Fiona S.L. Brinkman, and Sarah Otto.

Given a user-provided set of SARS-CoV-2 nucleotide mutations, this application compares the probability of generating this set from the following three distributions:

Background

SARS-CoV-2 evolution exhibits a strong clock-like signature with mutational changes accumulating over time, but this pattern is punctuated by “saltational changes”, where lineages appear with a higher number of mutations than expected from their divergence time from other lineages (Neher (2022)). Such unusual lineages are thought to reflect long passage times within immunocompromised individuals, sharing many of the same signatures seen in chronic infections (Harari et al. (2022)).

When unusual lineages arise, however, it is challenging to know the evolutionary history leading to the observed genomic changes. Other processes, including passage through animals, (Bashor et al. 2021, Naderi et al. (2023)) mutator lineages with error-prone polymerases (Takeda et al. (2023)), and exposure to mutagens such as molnupiravir (Gruber et al. (2024)), can also leave unusual genomic signatures.

Given a user-provided set of nucleotide mutations or genome consensus sequence defining an unusual lineage of SARS-CoV-2, this application compares the probability of generating this set from the following four distributions:

In the first paper, the authors demonstrate that specific lineage-defining mutation patterns occur in SARS-CoV-2 genomes that are sequenced from chronic infections vs. mutations that occurred in SARS-CoV-2 genomes sequenced around the globe at the start of the pandemic (before the rise of Variants of Concern (VOCs)). They also analyzed lineage-defining mutation patterns in VOCs, and concluded that “mutations in chronic infections are predictive of lineage-defining mutations of VOCs”.

Feng et al. sequenced hundreds of SARS-CoV-2 samples obtained from white-tailed deer in the United States. They observed Alpha, Gamma, Delta and Omicron VOCs and determined that the deer infections arose from a minimum of 109 separate transmission events from humans. In addition, the deer were then able to transmit the virus to each other. Deer infections resulted in three documented human zoonoses. The SARS-CoV-2 virus displayed specific adaptation patterns in deer, which differ from adaptations seen in humans.

In addition, the app informs the user whether the data contain signals consistent with:

Table 1: Mutator Sites. Known and Potential mutator sites (denoted by “Confirmed” and “Potential” in the “Site Type” column, respectively) are listed in the table below. Known sites have been confirmed experimentally, and the specific amino acid / nucleotide changes leading to mutator phenotypes are shown. Potential sites lie within the ExoN proofreading domain of nsp14 (as shown in Mack et al. 2023). The wild type amino acids, their positions within the mature nsp14 protein, encoding nucleotides and genomic locations are shown for these sites, but changes that would lead to mutator phenotypes have not been confirmed. Gene Amino Acid Change Nucleotide Change Site Type Reference
nsp14 C39F G18,155T Confirmed (Mack et al. 2023)
nsp14 F60S T18,218C Confirmed (Takada et al. 2023)
nsp14 P203L C18,647T Confirmed (Mack et al. 2023)
nsp14 D90 18,307-18,309 (GAT) Potential (Mack et al. 2023)
nsp14 E92 18,313-18,315 (GAG) Potential (Mack et al. 2023)
nsp14 E191 18,610-18,612 (GAG) Potential (Mack et al. 2023)
nsp14 H268 18,841-18,843 (CAT) Potential (Mack et al. 2023)
nsp14 D273 18,856-18,858 (GAT) Potential (Mack et al. 2023)

Application Use

This application accepts a list of comma separated nucleotide positions in a SARS-CoV-2 genome where lineage-defining mutations occur. Lineage-defining mutations are the subset of mutations in a lineage that have occurred since divergence from the larger SARS-CoV-2 tree. A list of lineage-defining mutations (the “mutation set”) for pangolin-designated SARS-CoV-2 lineages can be found here. The tool will also accept a FASTA file containing a SINGLE SARS-CoV-2 genome consensus sequence. In this case, the NextClade CLI is used to determine lineage-defining mutations (called private mutations in NextClade).

The application determines the likelihood of observing the mutation set as a random draw from each distribution (chronic infection, deer-specific mutations, global (pre-VOC) and global (Omicron era)). The log likelihood of observing the mutation set from each distribution is displayed (in natural log units)12.

Because the mutational data sets are sparse, the method bins sites across the genome when calculating likelihoods. The user can define the bin of interest: genes, genes splitting the spike protein into regions of interest, genome split into 500 nucleotide windows, or genome split into 1000 nucleotide windows. For a given bin choice, the log-likelihood of drawing the user-defined mutation set from each distribution is calculated from the multinomial distribution as:

sum(log(((distribution bin counts + 1) / sum(distribution bin counts + 1))^user bin counts))

The addition of one to each bin ensures that there are no bins lacking data.

CLI

A command line interface (CLI) is available for this application. The CLI is a Python script. You can install the necessary packages with conda using the following command:

conda env create -f environment.yaml

Here is an example of how to run the CLI with a list of mutations and the output you can expect:

$ python covid_mutation_distribution/cli.py "C241T, C3037T, A23403G, G28881A, G28882A, G28883C"
Number of mutations: 6
Transition/Transversion ratio: 5.00

Log Likelihoods:
  chronic: -11.04
  total chronic: -11.50
  deer: -12.92
  total deer: -12.11

Best fit distribution: (np.float64(-11.040868182380382), 'global_pre-VoC')
(1.58 times more likely than the global Omicron distribution)

Mutator lineage analysis:
  No mutator lineage detected

Full usage information can be found by running:

usage: cli.py [-h] [--bin-size {genes_split,gene,500,1000}] [--output {text,json}] [--plot] [--plot-output PLOT_OUTPUT] [--color-palette {plasma,viridis,inferno,seaborn}] [--verbose] mutations

SARS-CoV-2 Mutation Distribution Profiler (SMDP) CLI

positional arguments:
  mutations             Comma-separated list of mutations or path to a file containing mutations

options:
  -h, --help            show this help message and exit
  --bin-size {genes_split,gene,500,1000}
                        Bin size for analysis (default: gene)
  --output {text,json}  Output format (default: text)
  --plot                Generate a plot of mutation distribution
  --plot-output PLOT_OUTPUT
                        Output file for the plot (default: mutation_distribution.png)
  --color-palette {plasma,viridis,inferno,seaborn}
                        Color palette for the plot (default: plasma)
  --verbose             Print detailed information during analysis

Currently, the CLI only supports a single query at a time.

Notes on Input

Feedback

We're pleased to accept any feedback you have. You can submit an issue in the GitHub repository here. You can also email questions, comments or suggestions to erin.gill81(at)gmail.com. You can also leave comments in the Discussions tab.