SwiftOrtho is orthology analysis tool which identifies orthologs, paralogs and co-orthologs for genomes. It is a graph-based approach.
Make sure that you have the following installed
Python 3.7 or later (Recommend Anaconda ).
The GNU Compiler Collection GCC 6.0 or above.
MCL(optional).
$git clone https://github.com/Rinoahu/SwiftOrtho
$cd SwiftOrtho
$bash ./install.sh
$cd example
$bash ./run.sh
1. All-to-all homologous search:
$python SwiftOrtho/bin/find_hit.py -p blastp -i input.fsa -d input.fsa -o input.fsa.sc -e 1e-5 -s 111111
-i|-d: protein sequences in fasta format. The identifier of each protein sequence in intput.fsa should be like this: >xxx|yyyy where xxx is the taxon code and yyyy is a sequence identifier. For example:
>A|a1
MENIHDLWERALAEMEKKVSKPSYETWLKSTKANDIQNDVITITAPNEFARDWLEEHYAG
LTSDTIEHLTGARLTPRFVIPQNELEDDFLIEPPKKKKPVSDNGSQNNGTKTMLNDKYTF
...
>A|b2
MQFTIQRDRFVHDVQNVAKAVSSRTTIPILTGIKIVADHEGVTLTGSDSDVSIETFIPKE
...
-o: output file. A tabular text file which contains 14 columns. The first 12 columns are the same as blastp -m8 format, the last 2 columns are the lengths of the query and target sequences. For example:
A|a1 A|a1 100.00 450 0 0 1 450 1 450 2.88e-261 897 450 450
A|a1 B|b2 53.52 340 158 0 111 450 240 579 1.60e-105 380 450 583
...
-e: expect value.
-s: space seed pattern.
2. Orthology inference:
$python SwiftOrtho/bin/find_orth.py -i input.fsa.sc -c 0.5 -y 0 > input.fsa.sc.orth
-i: input file. It is the output file of step 1 or blast -m8.
-c: threshold of alignment coverage of pairwise sequences.
-y: threshold of alignment identity of pairwise sequences.
-s: separator between taxa and sequence id. Default is |.
>: output file. It is a tabular text file that contains 4 columns in the following format:
OT A|a1 B|b1 1.33510402833
IP A|a1 A|a2 1.23374340949
CO A|a2 B|b2 1.41459539212
...
Col 1: orthology relationship, one of OT(ortholog), CO(co-ortholog), or IP(in-paralog).
Col 2: identifier of gene in species A.
Col 3: identifier of gene in species B.
Col 4: weight of orthology relationship.
3. Cluster orthology relationships into groups:
SwiftOrtho implements two cluster algorithms: Markov Clustering, and Affinity Aropagation in Python. Here is an example of how to use it.
# use MCL
$python SwiftOrtho/bin/find_cluster.py -i input.fsa.sc.orth -a mcl -I 1.5 > input.fsa.sc.orth.mcl
# use APC
$python SwiftOrtho/bin/find_cluster.py -i input.fsa.sc.orth -a apc -I 1.5 > input.fsa.sc.orth.apc
-i: input file. Input file contains all the orthology relationships. For example:
OT A|a1 B|b1 1.33510402833
IP A|a1 A|a2 1.23374340949
CO A|a2 B|b2 1.41459539212
...
-a: algorithm to cluster. [mcl|apc].
-I: inflation parameter only for mcl.
>: output file. This file contains several rows. Each row stands for an orthologous group. In each row, there are gene identifiers of same or different species. For example:
A|a1 A|a2 B|b1 C|c1 D|d1
A|a3 B|b3 C|c3
A|a4 B|b4
A|a5 A|a6
...
If users want to use the original MCL for clustering, they can follow the steps below:
$cut -f2-4 input.fsa.sc.orth > input.fsa.sc.orth.xyz
$mcl input.fsa.sc.orth.xyz --abc -I 1.5 -o input.fsa.sc.orth.mcl -te 4
If the MCL cannot run due to insufficent memory, users can try my MCL_lite project which is another memory-efficent implementation of Markov Clustering Algorithm. The MCL_lite project is available at https://github.com/Rinoahu/MCL_lite.
The merge.py in scripts directory is used to merge multiple fasta files into a single one and add a label to gene identifier for each species.
Usage: $python scripts/merge.py dir_name > merge.fasta
dir_name is the directory that contains all the fasta files.
Run_all.py in directory scripts automatically implements the following steps:
Requirement:
Usage:
$python run_all.py -i test.fsa -p test.fsa.operon -a 4
-i: input file. protein sequences in fasta format. The identifier of each protein sequence in intput.fsa should be like this: >xxx|yyyy where xxx is the taxon code and yyyy is a sequence identifier.
-p: operonic annotation file. The 1st column of this file should be like x0-->x1-->x2-->x3 or x0<--x1<--x2<--x3. x# stand for gene identifiers and <-- or --> stands for strand of gene. For example:
A|a0-->A|a1 unknown-->COG1607 unknown::unknown-->I::Acyl-CoA hydrolase::Lipid transport and metabolism
B|b0<--B|b1 COG4644<--COG1961 X::Transposase and inactivated derivatives, TnpA family::Mobilome: prophages, transposons<--L::Site-specific DNA recombinase related to the DNA invertase Pin::Replication, recombination and repair
-a: number of multi-thread.
Results: Several files are generated:
test.fsa.sc
Results of all-vs-all homology search. See example above in Usage.
test.fsa.aln.trim
Concatenation of trimmed aligned protein sequences of conserved genes.
test.fsa.nwk
Species phylogenetic tree in newick format. the tree is constructed from the aligned protein sequences of the conserved gene.
test.fsa.opc
Orthology relationships. See the example above in Usage.
test.fsa.mcl
Orthologous groups. See the example above in Usage.
test.fsa.operon.mcl
Grouped operons which reflect conservation of operons across multiple species. This file contains several rows. Each row contains operonic information from the same or different species. For example:
A1-->A2-->A3 B1<--B2<--B3 C1<--C2<--C3
A4<--A5<--A6 B4<--B5<--B6 C4<--C5<--C6
A7-->A8 B7<--B8<--B9
....
test.fsa.pan
main features of pan-genome. For example:
# Statistics and profile of pan-genome:
# The methods can be found in Hu X, et al. Trajectory and genomic determinants of fungal-pathogen speciation and host adaptation.
#
# statistic of core, shared and specific genes:
# Feature core shared specific taxon
# Number 27 2117 9766 5
#
# ω(core size of pan-genome) and 95% confidence interval:
# κc τc ω
# 18001.747907101293±97986.86937584748 0.4604747552601067±0.5879003578202601 29.071595667457963±45.51565446328978
#
# θ(new gene number for everay new genome sequenced) and 95% confidence interval:
# κs τs tg(θ)
# 1334.0072284367752±2342.5492209911768 2.2743910535524314±9.701652708550565 1952.605831944348±1311.6323603805986
#
# κ(size and openess of pan-genome, open if γ > 0) and 95% confidence interval:
# κ γ
# 2899.5570049130965±179.58438208536737 0.8785342365438822±0.04423040927927408
#
# Type and frequency of each gene group in different species:
################################################################################
#family type GCF_000005825.2_ASM582v2 GCF_000006645.1_ASM664v1 GCF_000006605.1_ASM660v1 GCF_000005845.2_ASM584v2 GC
F_000006625.1_ASM662v1
group_000000000 Share 0 1 0 1 0
group_000000001 Specific 2 0 0 0 0
group_000000002 Specific 2 0 0 0 0
group_000000003 Specific 2 0 0 0 0
group_000000004 Specific 3 0 0 0 0
...
To cite our work, please refer to:
SwiftOrtho: A fast, memory-efficient, multiple genome orthology classifier. Xiao Hu, Iddo Friedberg. GigaScience, Volume 8, Issue 10, giz118, https://doi.org/10.1093/gigascience/giz118