TS404 / AlignStat

A tool for the statistical comparison of alternative multiple sequence alignments
http://AlignStat.science.latrobe.edu.au
6 stars 1 forks source link

AlignStat: A tool for the statistical comparison of alternative multiple sequence alignments

Thomas M A Shafee, Ira R Cooke

Department of Biochemistry, La Trobe Institute for Molecular Science, La Trobe University, Melbourne, Australia
College of Science, Health and Engineering, La Trobe University, Melbourne, Australia
Life Sciences Computation Centre, Victorian Life Sciences Computation Initiative, Melbourne, Australia

Resources

Online webtool: AlignStat.science.latrobe.edu.au
On CRAN: CRAN.r-project.org/web/packages/AlignStat
Publication: Shafee and Cooke, BMC Bioinformatics 2016 17:434

Description

This package contains functions that compare two alternative multiple sequence alignments (MSAs) to determine how well they align homologous residues in the same columns as one another. It classifies similarities and differences into conserved sequence, conserved gaps, splits, merges and shifts. Summarising these categories for each column yields information on which columns are agreed upon by both MSAs, and which differ. Output graphs visualise the comparison data for analysis.

Contains functions

compare_alignments            Pairwise alignment of alignments
plot_match_summary            Summary plot of alignment similarities
plot_category_proportions     Detailed plot of alignment differences
plot_alignment_heatmap        Heatmap of similarities between alignment columns

Installation

From CRAN

install.packages("AlignStat")

From GitHub

install.packages("devtools")
devtools::install_github("TS404/AlignStat")
library("AlignStat")

compare_alignments

Compare alternative multiple sequence alignments

Description

This function aligns two multiple sequence alignments (MSA) against one another. The alternative alignments must contain the same sequences in any order. Fasta, culstal, msf, phylip or mase formats are accepted. The function will classify any similarities and differences between the two MSAs.

It produces the "pairwise alignment comparison" object required as the first step any of the other package functions.

Usage

compare_alignments (ref, com)

Arguments

ref   The reference MSA (in fasta, culstal, msf, phylip or mase format)
com   The MSA to compare (in fasta, culstal, msf, phylip or mase format)
SP    Additionally, also calculate sum of pairs and related scores (default = FALSE)

Value

Generates an object of class "pairwise alignment comparison" (PAC), providing the optimal pairwise column alignment of two alternative MSAs of the same sequences, and summary statistics of the differences between them. The details of the PAC output components are as follows:

reference_P           A numbered character matrix of the reference alignment
comparison_Q          A numbered character matrix of the comparison alignment
results_R             A matrix whos [i,j]th entry is the ith match category average of the
                      jth column of the reference alignment versus the comparison alignment
                      (i1=match, i2=conserved gap, i3=merge, i4=split, i5=shift) Used to
                      generate the similarity summary and dissimilarity summary plots.
similarity_S          A similarity matrix whose [i,j]th entry is the similarity score between
                      the ith column of the reference alignment and the jth column of the
                      comparison alignment. Used to determine which columns are most similar
                      for further analysis. Used to generate the similarity heatmap plot.
dissimilarity_D       A dissimilarity matrix whose [i,j,k]th entry is the kth match category
                      of the jth residue of the ith sequence for the reference alignment
                      versus the comparison alignment (k1=match, k2=conserved gap, k3=merge,
                      k4=split, k5=shift).
dissimilarity_simple  A matrix whose [i,j]th entry is the dissimilarity category of the jth
                      residue of the ith sequence for the reference alignment versus the
                      comparison alignment (M=match, g=conserved gap, m=merge, s=split, x=shift).
                      Generated from the dissimilarity matrix with categories stacked into a
                      single 2D matrix. Used to the dissimilarity matrix plot.
columnmatch           The column of the comparison alignment with the highest final match score
cys                   The proportion of cysteines (relevant for cysteine rich proteins)
reflen                The number of columns in the reference alignment
comlen                The number of columns in the comparison alignment
refcon                The consensus sequence of the reference alignment
comcon                The consensus sequence of the comparison alignment
similarity_score      The overall similarity score between the reference and comparison alignments
column_score          The number and location of columns with 100% identity
sum_of_pairs          The sum of pairs score and related data (optional)

Details

The compare_alignments compares two alternative multiple sequence alignments (MSAs) of the same sequences. The alternative alignments must contain the same sequences in any order. The function classifies similarities and differences between the two MSAs. It produces the "pairwise alignment comparison" object required as the first step any other package functions.

The function converts the MSAs into matrices of sequence characters labelled by their occurrence number in the sequence (e.g. to distinguish between the first and second cysteines of a sequence). It then compares the two MSAs to determine which columns have the highest similarty between the reference and comparison MSAs to generate a similarity matrix (excluding conserved gaps). From this matrix, the comparison alignment column with the similarity to each reference alignment column is used to calculate further statistics for dissimilarity matrix, summarised for each reference MSA column in the results matrix. Lastly, it calculates the overall similarity score between the two MSAs.

Example

data("reference_alignment")
data("comparison_alignment")
PAC <- compare_alignments(reference_alignment,comparison_alignment)

plot_similarity_heatmap

A heatmap plot of the column identities between two multiple sequence alignments

Usage

plot_similarity_heatmap (x, scale=TRUE, display=TRUE)

Arguments

x          an object of type "pairwise alignment comparison"
           (typically the summary file generated by compare_alignments)
scale      scale data to proportion of characters that are not conserved gaps (default = TRUE)
display    display this plot (default = TRUE)

Details

The plot_similarity_heatmap function displays the similarity between each pairwise column comparison for the reference and comparison MSAs. Colour density is determined by the proportion of identical character matches between the columns, normalised to the number of characters that are not merely conserved gaps. This gives a representation of which columns are well agreed upon by the MSAs, and which columns are split by one MSA relative to the other.

Example

plot_similarity_heatmap (PAC)

plot_dissimilarity_matrix

A heatmap plot of the dissimilarity matrix of two multiple sequence alignments

Usage

plot_dissimilarity_matrix (x, display=TRUE)

Arguments

x          an object of type "pairwise alignment comparison" 
           (typically the summary file generated by compare_alignments)
display    display this plot (default = TRUE)

Details

The plot_dissimilarity_matrix function displays the dissimilarity categories for all characters in the reference alignment. This gives a representation of which columns are well agreed upon by the MSAs, and which sequence regions of the reference alignment are split, merged, or shifted.

Example

data(reference_alignment)
data(comparison_alignment)
PAC <- compare_alignments(reference_alignment,comparison_alignment)
plot_dissimilarity_matrix(PAC)

plot_similarity_summary

A line plot summary of column similarity between two multiple sequence alignments

Usage

plot_similarity_summary (x, scale=TRUE, CS=FALSE, cys=FALSE, display=TRUE)

Arguments

x          an object of type "pairwise alignment comparison" 
           (typically the summary file generated by compare_alignments)
scale      scale data to proportion of characters that are not conserved gaps (default = TRUE)
CS         additionally indicate columns with 100% identity using markers on the x-axis (default = TRUE)
cys        additionally show the cysteine abundance for each column (default = FALSE)
display    display this plot (default = TRUE)

Details

The plot_similarity_summary function generates a plot that summarises the similarity between the two multiple sequence alignments for each column of the reference alignment. For each column, it plots the proportion of identical character matches as a proportion of the characters that are not merely conserved gaps. The overall average proportion of identical characters that are not conserved gaps is overlaid as a percentage. For alignments of cysteine-rich proteins, the cysteine abundance for each column may also be plotted to indicate columns containing conserved cysteines (cys=TRUE).

Example

plot_similarity_summary (PAC, CS=TRUE, cys=TRUE)

plot_dissimilarity_summary

An area plot summary of the different causes of column dissimilarity between two multiple sequence alignments

Usage

plot_category_proportions (x, scale=TRUE, stack=TRUE, display=TRUE)

Arguments

x          an object of type "pairwise alignment comparison" 
           (typically the summary file generated by compare_alignments)
scale      scale data to proportion of characters that are not conserved gaps (default = TRUE)
stack      stacked area plot in stead of line plot (default = TRUE)
display    display this plot (default = TRUE)

Details

The plot_dissimilarity_summary function generates a detailed breakdown of the differences between the multiple sequence alignments for each column of the reference alignment. For each column, the relative proportions of merges, splits and shifts is plotted as a proportion of characters that are not merely conserved gaps.

Example

plot_dissimilarity_summary (PAC)

plot_SP_summary

A line plot summary of sum of pairs score between two multiple sequence alignments

Usage

plot_SP_summary (x, CS=TRUE, display=TRUE)

Arguments

x          an object of type "pairwise alignment comparison" 
           (typically the summary file generated by compare_alignments)
CS         indicate columns with 100% identity using markers on the x-axis (default = TRUE)
display    display this plot (default = TRUE)

Details

The plot_SP_summary function generates a plot that summarises the columnwise sums of pairs for the two multiple sequence alignments. For each column of the comparison alignment, it plots the proportion of conserved residue pairs as a proportion of the poassible residue pairs. The overall sum of pairs score, reverse sum of pairs score, and column score are also reported as percentages.

Example

PAC <- compare_alignments(reference_alignment, comparison_alignment, SP=TRUE, CS=TRUE)
plot_SP_summary(PAC)

Full example workflow

# Example data loading
data("reference_alignment")
data("comparison_alignment")
# Alignment comparison calculation
PAC <- compare_alignments  (reference_alignment, comparison_alignment, CS=TRUE, SP=TRUE)
# Results visualisation
plot_similarity_heatmap    (PAC)
plot_dissimilarity_matrix  (PAC)
plot_similarity_summary    (PAC, CS=TRUE, cys=TRUE)
plot_dissimilarity_summary (PAC, stack=TRUE)
plot_SP_summary            (PAC, CS=TRUE)