|
|
---|
Simplot++ is an open-source multi-platform application designed by Stéphane Samson, Étienne Lord and Vladimir Makarenkov (Université du Québec à Montréal). It is implemented in Python. SimPlot++ produces publication-ready SimPlot and plots using 43 nucleotide and 20 amino acid distance models. Intergenic and intragenic recombination events can be identified using Phi, χ2, NSS and Proportion tests. Simplot++ also generates and analyzes interactive sequence similarity networks, while supporting multi-processing and providing distance calculability diagnostics.
SimPlot++ offers the following features:
SimPlot++ for Windows is available either as an executable file or as a Python script.
The Windows installer can be found at the release. Click on the SimPlot++-x.x-win64.msi
file to download it.
A requirements.txt
file containing all required libraries is available in the GitHub repository.
Assuming Python 3.8 or higher is installed on the machine, the script should run well with the libraries installed.
Here is an example of how to run the script in Windows:
main.py
.python3 -m venv SimPlot++_venv
.SimPlot++_env/Scripts/Activate.ps1
.pip install -r requirements.txt
.python3 main.py
.SimPlot++ is available as a Python script.
A requirements.txt
file containing all required libraries is available in the GitHub repository.
Assuming Python 3.8 or higher is installed on the machine, the script should run well with the libraries installed.
Here is an example of how to run the script in Linux/UNIX or Mac OS:
main.py
.virtualenv
installed, run python3 -m pip install --user virtualenv
python3 -m venv SimPlot++_env
.source SimPlot++_env/bin/activate
.python3 -m pip install -r requirements.txt
.python3 main.py
.The group page allows users to load multiple sequences files (in the Fasta, Nexus, Pir, Phylip, Stockholm or Clustal format) and manually organize individual sequences in groups. Don't forget to align your sequences before loading them in SimPlot++. For each of the groups created by the user, a consensus sequence will be generated by SimPlot++. The % threshold for consensus sequences can be modified in the preference tab of the menu. The consensus groups are essential for the SimPlot, BootScan and Network Similarity analyses. The groups can be saved in a Nexus file to avoid recreating them every time.
Below is a summary of the steps presented:
A SimPlot analysis uses a window of a specified size and a specified advancement step to slide this window over the Multiple Sequences Alignment (MSA). Every sub-MSA covered by the window is extracted and a distance matrix based on a selected distance model is generated. This distance matrix is then used to produce a similarity plot for every consensus sequence against the reference sequence chosen by the user. The variations in similarity between the reference sequence and the consensus sequences can be used, for example, to detect potential recombination events.
Window length: Determines the size of the sliding window
Step: Determines the size of the advancement step of the sliding window (with overlap)
Strip gap: Determines the maximum number of ambiguous positions allowed in a consensus sequence (optional)
Multiprocessing: Allows multiple windows to be analyzed simultaneously. Recommended for large datasets
Distance Model: 43 DNA models and 20 amino acid models available
A new feature of SimPlot++ is the quality control feature for the Simplot analysis allowing the user to visualize gap-related data (such as the regions where the number of gaps in the consensus sequence was higher than the gap threshold permitted).
Overall data completeness (whether genetic distances were successfully computed or not in each window) can also be viewed through this feature.
Since certain models can generate errors when computing genetically distant sequences (divisions by zero, log of negative numbers, etc.), it is recommended to use this feature to check if such issues happened. If it did, it is recommended to modify the analysis parameters, consensus groups, consensus threshold, or the distance model used. As a general rule, simpler models tend to be more lenient.
The sequence similarity network analysis is an interactive representation of a SimPlot analysis using a window in which every group (including the reference group) is represented by a network node. These nodes are connected by an edge depending on the calculated global (over the whole sequence) or local (over sub-sequences of a selected length) similarity.
By adjusting the minimum similarity threshold required to show each of the edge types (global and local), it is possible to get a better insight on the relationships between every group. Furthermore, the network similarity representation can be limited to a specific range of the full MSA (in order to analyze a gene or region of interest).
Datatables allows the user to view the most important similarity regions between the network nodes (i.e. MSA sequences or sub-sequences). The results of the Proportion test are also available in a datatable (discussed in more details in the statistical methods section).
The graph data and visualization can be saved in an HTML file. The graph itself can be saved as either a .png or .svg (the option is in the user preference page) directly from the toolbox in the HTML file.
Bootscanning is a pipeline consisting of 4 main steps, all done using a sliding window analysis (as in the SimPlot analysis).
Bootstrap: Number of replicates to be generated for each sub-MSA (each position of the sliding window)
Tree model: Neighbor-Joining or UPGMA
Window length: Size of the sliding window
Step: Sliding window advancement step
Distance Model: 43 DNA substitution models are available
Multiprocessing: Allows multiple windows to be analyzed simultaneously (recommended for large datasets)
The FindSites scan is used for locating possible regions of recombination by identifying informative sites. The first step of the analysis is to select a sequence assumed to be originated from a recombination event as well as two sequences of interest (one from each of the two possible parental evolutionary lines), and a fourth sequence as an outgroup. Informative sites will be identified as those where, at the same position, two of the sequences share the same nucleotide, and the other two sequences share another (different) nucleotide.
Statistical tests for detecting recombination events from the PhiPack package by Trevor Bruen et al. have been implemented in SimPlot++. The multiprocessing option is recommended for faster computation.
The Phi, Phi-profile, Max χ2 and NSS tests are available for both the ungrouped (raw sequences) and grouped consensus sequences.
Additional information on these tests can be found here.
Moreover, a new simple Proportion test has been designed as a complement to the traditional SimPlot analysis in order to identify quickly the most likely mosaic regions (i.e. possible recombination events) in the grouped sequences. This test is based on the proportion of genetic distances extracted from the SimPlot distance matrices. The Proportion score is an indicator of the signal strength but should not be always considered as a recombination signal.
Please email us at : samson.stephane.3@courrier.uqam.ca or makarenkov.vladimir@uqam.ca for any question or feedback.