Please ensure that the chromosome IDs in your GTF annotation files are consistent with those in your genome FASTA files. This is a critical step to avoid any mismatches that could lead to errors in subsequent analysis or processing steps.
GageTracker is a python package for dating gene age by micro- and macro-syteny with high speed and accuracy. It's can:
All the dependencies (listed in the following table) should be pre-installed. The users need to add all the corresponding executable programs to the environmental path before running GageTracker for dating gene age. | Software | Links |
---|---|---|
python | Python >=3.8 | |
last | https://gitlab.com/mcfrith/last | |
tantan | https://gitlab.com/mcfrith/tantan | |
bedtools | https://github.com/arq5x/bedtools2/releases | |
maf2synteny | https://github.com/fenderglass/maf2synteny | |
BLAST toolkit | ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ | |
bedtools (v2.29.2) | https://github.com/arq5x/bedtools2/releases |
WindowMasker is a subprogram of BLAST+, so it does not need to be installed separately. And the UCSC tools are also needed, please downloaded these tools from http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/.
axtChain, axtSort, axtToMaf, axtToPsl, chainCleaner,
chainFilter, chainNet, chainPreNet, chainSort, chainStitchId,
chainSwap, chainToPsl, faSize, faToTwoBit, mafToPsl,
netChainSubset, netSyntenic, netToAxt, pslToChain
Please note that all dependencies must be added to the environment path before using GageTracker!!
git clone https://github.com/RiversDong/GageTracker.git
cd GageTracker && chmod 755 ./*
export PATH=$PATH:/path/to/GageTracker
/path/to/GageTracker is the path of unziped GageTracker file that downloaded by git clone https://github.com/RiversDong/GageTracker.git
pip install gtfparse==1.2.1
pip install pandas==1.4.3
pip install biopython==1.79
There are 9 necessary parameters in the control file (ctl), which are used to designate the input annotation of target species, output path, the main branches and the reference genome list of each branch. To illustrate how to prepare the control file, we used the gene dating task of Drosophila melanogaster (D.melanogaster) as a case. In this example, there is a total including 7 branches (from branch 0 to branch 6), and 12 Drosophila (Figure 1 from Zhang Y E et al. Genome research, 2010, 20(11): 1526-1533). The detailed parameters are listed in table 1, which totally contains three columns, the first is the required parameters, the second is an example to show how to set this parameter, and the final one is an explanation of the corresponding parameter.
The following table gives the detailed explanations for each of the parameters in user’s control file, in which the first column lists the necessary parameters, the second column lists the example for what value should be given and the final one gives its corresponding description.
Parameters |
Example |
Explanations |
branch |
branch = ["br6", "br5", "br4", "br3", "br2", "br1", "br0"] |
The parameter “branch” is a python list type, which is used for designating the branch name. Please label the branch name from old to young. In this example br6 represents the youngest branch and br0 represents the oldest branch. |
old_br |
old_br = ["br0"] |
The parameter is a python list type, which is used for designating the old branch. |
protein_seq |
protein_seq = "/path/to/fasta/dmel_protein.fasta" |
(Target protein sequence) The parameter is a python string type. This parameter specifies the path of the longest protein sequence of each gene (stored in fasta format). Please note that the sequence names should be the same as their corresponding genes and the sequences are represented by the longest protein sequences. For example, gene g encodes two proteins: p1 and p2, among which p2 is the longest one, then the sequence should be organized as: >g The protein sequence of p2 |
outpath |
outpath = "/path/to/dating |
(output path) The parameter is a python string type. This parameter specifies the output of temporary files and gene age files |
target |
target = "/path/to/fasta/dmel.fasta" |
(Target genome) The parameter is a python string type. This parameter specifies the genome of our target species that we want to data gene age. |
annotation |
annotation = "/path/to/fasta/dmel.gtf" |
(Target annotation) The parameter is a python string type. This parameter specifies the gene annotation of our target species (in gtf format). We have tested the annotation from Ensembl and NCBI. All of these annotations work well. Note: the GTF file should contain these features: gene, exon, and CDS in the third column. |
reference |
reference["br0"] = ["/path/to/fasta/dvir.fasta", "/path/to/fasta/dmoj.fasta", "/path/to/fasta/dgri.fasta"] reference["br1"] = ["/path/to/fasta/dwil.fasta"] reference["br2"] = ["/path/to/fasta/ dper.fasta", "/path/to/fasta/dpse.fasta"] reference["br3"] = ["/path/to/fasta/dana.fasta"] reference["br4"] = ["/path/to/fasta/dere.fasta", "/path/to/fasta/dyak.fasta"] reference["br5"] = ["/path/to/fasta/dsec.fasta", "/path/to/fasta/dsim.fasta"] |
(Out-group species genomes) The parameter is a python dictionary type. This parameter designates the branch and its corresponding genome path. For example, branch br4 contains two species (Figure 1) dere and dyak, which are stored in /path/to/fasta/. Therefore, the pair of key and values between br4 and its corresponding species should be written as reference["br4"] = ["/path/to/fasta/dere.fasta", "/path/to/fasta/dyak.fasta"]. Please note that reference parameter should only contain the genome of outgroup species excluding the genome of our target species. |
age |
age = "dmel.age" |
The parameter is a python string type. This parameter specifies the file for storing gene age, which will be stored in the path designated by output parameter. In this example, the final output is /path/to/outpath/dmel.age. |
voting |
voting = 0.5 |
The parameter is a python int type, which is used to determine whether a gene is present or absent in outgroup species(default is 0.5). |
GageTracker example.ctl [options]
Options:
-h, -help show all options and their default settings, and exit
-V -version show version information, and exit
-mos mask the outgroup species using windowmasker (default: NOT mask the outgroup species)
-lg align the large genome using lastal5 (default: lastal)
-p running the program in multi-processes. By default, GageTracker uses 6 threads when calling lastal. The -p parameter specifies the number of processes for genome alignment, with each process representing a pair of genome alignment tasks and utilizing 6 threads during alignment. Therefore, when GageTracker is run with -p 6, it initiates 6 processes for genome alignment, each using 6 threads, resulting in a total of 36 threads (6 * 6 = 36) actively performing genome alignment.
-ao or -step1 only performs alignments and get the rbh alignment according to the outgroup species.
-rbh or -step2 calculate the rBH regions by cosindering the genome alignment.
-da or -step3 only performs gene dating according to the ctl files.
-m infer the originating mechanism for young genes (based on the BLASTp alignments)
# Dating gene age without masking outgroups genomes (for genome size <1 Gb)
GageTracker dm.ctl
# Dating gene age and infer originating mechanism with 12 processes
GageTracker dm.ctl -m -p 12
# Dating gene age without masking outgroups genomes (for genome size >1 Gb)
GageTracker zebrafish.ctl -lg -p 2
# Dating gene age with masking outgroups genomes (for large genome size ~3 Gb)
GageTracker human.ctl -lg -p 2 -mos
# Only run the genome alignment with 5 processes
GageTracker dm.ctl -ao -p 5
# Only run the genome alignment with 5 processes and mask the outgroup species
GageTracker dm.ctl -ao -mos -p 5
# Only run the genome alignment with 5 processes and mask the outgroup species, and also handling with aligning with large genomes
GageTracker dm.ctl -ao -mos -lg -p 5
# Get the RBH alignments based on the genome alignments with 5 processes
GageTracker dm.ctl -rbh -p 5
or
GageTracker dm.ctl -step2 -p 5
# Get the gene age based on the results from the previous two steps (genome alignment and RBH results) with 5 processes
GageTracker dm.ctl -da -p 5
or
GageTracker dm.ctl -step3 -p 5
We have provided a toolkit (addaln), which allow users add a new reference genome without performing additional alignments that have done. Just typing the following, a new RBH alignment will be done
addaln -add /home/chuand/new_gene/virilis/fasta/dbusckii.fasta -br B1 -ct dvirilis.ctl
-add: specify the genome path that user wants to add
-br: specify the branch that the newly add genome belong to
-ct: tell the path of control.
After this procedure is done, add the following to user’s previous control file
reference[“B1”] = [/home/chuand/new_gene/virilis/fasta/dbusckii.fasta]
Then, perform the age dating procedure (the third stage) by typing the following, Noting that dvirilis.ctl contains the newly added reference species and the updated branch list.
GageTracker dvirilis.ctl -da -p 5
or
GageTracker dvirilis.ctl -step3 -p 5
We provide a tool, gage_diff.py (get gene age of different annotation type), to filter gene age according to user’s needed. Just type the gene type according to the prompt message from the tool. Bellowing is an example
gage_diff.py test.ctl
You can select the following gene type:
1 tRNA
2 rRNA
3 transcribed_pseudogene
4 pseudogene
5 snRNA
6 lncRNA
7 guide_RNA
8 snoRNA
9 protein_coding
Select the gene type listed above: lncRNA
Your output is stored in the outpath that specifies in ctl.
The above example will give the gene age list of lncRNA, which stored in the path specified by ctl file. In the same way, users can also choose the age of protein-coding genes from the annotation results. Noting that the results is filtered by repeat sequences ratio among the total exon length of a gene.
The output contains four key columns (Confidence, Branch, Chromosome and GeneMaskRatio) and several supplementary columns. In “Confidence”, CON means the alignment is not detected in sequencing gaps and NCON means the alignment is detected in sequencing gaps, therefore such genes marked by NCON should be considered as unreliable, which means that such genes are deemed as young genes not because it can not be found in out group species, but because of the sequencing quality.
ERROR NetFilterNonNested.perl (comes with the chaincleaner source code) is not a binary in sPATH, Either install it or provide the nets as input
Please download NetFilterNonNested.perl from https://github.com/hillerlab/GenomeAlignmentTools/blob/master/src/NetFilterNonNested.perl, then move this script to your PATH and grant it executable permissions.
gene.bed file is empty
Check the GTF file to ensure it includes the "gene" feature. If the file lacks the "gene" feature, GageTracker won't be able to extract position information, as it relies on the presence of the "gene" keyword in the feature columns.