RiversDong / GageTracker

tool for dating gene age by micro- and macro-synteny with high speed and accuracy
10 stars 4 forks source link

0. Items to note in advance

Please ensure that the chromosome IDs in your GTF annotation files are consistent with those in your genome FASTA files. This is a critical step to avoid any mismatches that could lead to errors in subsequent analysis or processing steps.

1. About GageTracker

GageTracker is a python package for dating gene age by micro- and macro-syteny with high speed and accuracy. It's can:

2. Installation

2.1. Dependencies

All the dependencies (listed in the following table) should be pre-installed. The users need to add all the corresponding executable programs to the environmental path before running GageTracker for dating gene age. Software Links
python Python >=3.8
last https://gitlab.com/mcfrith/last
tantan https://gitlab.com/mcfrith/tantan
bedtools https://github.com/arq5x/bedtools2/releases
maf2synteny https://github.com/fenderglass/maf2synteny
windowmasker https://github.com/goeckslab/WindowMasker
BLAST toolkit ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
bedtools (v2.29.2) https://github.com/arq5x/bedtools2/releases

WindowMasker is a subprogram of BLAST+, so it does not need to be installed separately. And the UCSC tools are also needed, please downloaded these tools from http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/.

axtChain, axtSort, axtToMaf, axtToPsl, chainCleaner,
chainFilter, chainNet, chainPreNet, chainSort, chainStitchId,     
chainSwap, chainToPsl, faSize, faToTwoBit, mafToPsl,
netChainSubset, netSyntenic, netToAxt, pslToChain

Please note that all dependencies must be added to the environment path before using GageTracker!!

2.2. Download the latest released GageTracker version from github:

  git clone https://github.com/RiversDong/GageTracker.git
  cd GageTracker && chmod 755 ./*
  export PATH=$PATH:/path/to/GageTracker

/path/to/GageTracker is the path of unziped GageTracker file that downloaded by git clone https://github.com/RiversDong/GageTracker.git

2.3. Install package gtfparse, pandas, and biopython by typing the following command lines

pip install gtfparse==1.2.1
pip install pandas==1.4.3
pip install biopython==1.79

3. Usage

3.1. Prepare the users defined control file

There are 9 necessary parameters in the control file (ctl), which are used to designate the input annotation of target species, output path, the main branches and the reference genome list of each branch. To illustrate how to prepare the control file, we used the gene dating task of Drosophila melanogaster (D.melanogaster) as a case. In this example, there is a total including 7 branches (from branch 0 to branch 6), and 12 Drosophila (Figure 1 from Zhang Y E et al. Genome research, 2010, 20(11): 1526-1533). The detailed parameters are listed in table 1, which totally contains three columns, the first is the required parameters, the second is an example to show how to set this parameter, and the final one is an explanation of the corresponding parameter.

Tree

The following table gives the detailed explanations for each of the parameters in user’s control file, in which the first column lists the necessary parameters, the second column lists the example for what value should be given and the final one gives its corresponding description.

Parameters

Example

Explanations

branch

branch = ["br6", "br5", "br4", "br3", "br2", "br1", "br0"]

The parameter branch is a python list type, which is used for designating the branch name. Please label the branch name from old to young. In this example br6 represents the youngest branch and br0 represents the oldest branch.

old_br

old_br = ["br0"]

The parameter is a python list type, which is used for designating the old branch.

protein_seq

protein_seq = "/path/to/fasta/dmel_protein.fasta"

(Target protein sequence) The parameter is a python string type. This parameter specifies the path of the longest protein sequence of each gene (stored in fasta format). Please note that the sequence names should be the same as their corresponding genes and the sequences are represented by the longest protein sequences. For example, gene g encodes two proteins: p1 and p2, among which p2 is the longest one, then the sequence should be organized as:

>g

The protein sequence of p2

outpath

outpath = "/path/to/dating

(output path) The parameter is a python string type. This parameter specifies the output of temporary files and gene age files

target

target = "/path/to/fasta/dmel.fasta"

(Target genome) The parameter is a python string type. This parameter specifies the genome of our target species that we want to data gene age.

annotation

annotation = "/path/to/fasta/dmel.gtf"

(Target annotation) The parameter is a python string type. This parameter specifies the gene annotation of our target species (in gtf format). We have tested the annotation from Ensembl and NCBI. All of these annotations work well. Note: the GTF file should contain these features: gene, exon, and CDS in the third column.

reference

reference["br0"] = ["/path/to/fasta/dvir.fasta", "/path/to/fasta/dmoj.fasta", "/path/to/fasta/dgri.fasta"]

reference["br1"] = ["/path/to/fasta/dwil.fasta"]

reference["br2"] = ["/path/to/fasta/ dper.fasta", "/path/to/fasta/dpse.fasta"]

reference["br3"] = ["/path/to/fasta/dana.fasta"]

reference["br4"] = ["/path/to/fasta/dere.fasta", "/path/to/fasta/dyak.fasta"]

reference["br5"] = ["/path/to/fasta/dsec.fasta", "/path/to/fasta/dsim.fasta"]

(Out-group species genomes) The parameter is a python dictionary type. This parameter designates the branch and its corresponding genome path. For example, branch br4 contains two species (Figure 1) dere and dyak, which are stored in /path/to/fasta/. Therefore, the pair of key and values between br4 and its corresponding species should be written as reference["br4"] = ["/path/to/fasta/dere.fasta", "/path/to/fasta/dyak.fasta"]. Please note that reference parameter should only contain the genome of outgroup species excluding the genome of our target species.

age

age = "dmel.age"

The parameter is a python string type. This parameter specifies the file for storing gene age, which will be stored in the path designated by output parameter. In this example, the final output is /path/to/outpath/dmel.age.

voting

voting = 0.5

The parameter is a python int type, which is used to determine whether a gene is present or absent in outgroup species(default is 0.5).

 

3.2. Running the program after preparing the user-defined control file

The basic command

GageTracker example.ctl [options]
    Options:
        -h, -help   show all options and their default settings, and exit
        -V         -version show version information, and exit
        -mos        mask the outgroup species using windowmasker (default: NOT mask the outgroup species)
        -lg         align the large genome using lastal5 (default: lastal)
        -p          running the program in multi-processes. By default, GageTracker uses 6 threads when calling lastal. The -p parameter specifies the number of processes for genome alignment, with each process representing a pair of genome alignment tasks and utilizing 6 threads during alignment. Therefore, when GageTracker is run with -p 6, it initiates 6 processes for genome alignment, each using 6 threads, resulting in a total of 36 threads (6 * 6 = 36) actively performing genome alignment.
        -ao or -step1   only performs alignments and get the rbh alignment according to the outgroup species. 
        -rbh or -step2  calculate the rBH regions by cosindering the genome alignment.
        -da or -step3   only performs gene dating according to the ctl files.
        -m          infer the originating mechanism for young genes (based on the BLASTp alignments)

Several examples for running the command line

# Dating gene age without masking outgroups genomes (for genome size <1 Gb)
GageTracker dm.ctl

# Dating gene age and infer originating mechanism with 12 processes
GageTracker dm.ctl -m -p 12

# Dating gene age without masking outgroups genomes (for genome size >1 Gb)
GageTracker zebrafish.ctl -lg -p 2 

# Dating gene age with masking outgroups genomes (for large genome size ~3 Gb)
GageTracker human.ctl -lg -p 2 -mos 

# Only run the genome alignment with 5 processes
GageTracker dm.ctl -ao -p 5

# Only run the genome alignment with 5 processes and mask the outgroup species
GageTracker dm.ctl -ao -mos -p 5

# Only run the genome alignment with 5 processes and mask the outgroup species, and also handling with aligning with large genomes
GageTracker dm.ctl -ao -mos -lg -p 5

# Get the RBH alignments based on the genome alignments with 5 processes
GageTracker dm.ctl -rbh -p 5
or
GageTracker dm.ctl -step2 -p 5

# Get the gene age based on the results from the previous two steps (genome alignment and RBH results) with 5 processes
GageTracker dm.ctl -da -p 5
or
GageTracker dm.ctl -step3 -p 5

4. TIPs

4.1. TIP1: add new whole genome alignment as reference

We have provided a toolkit (addaln), which allow users add a new reference genome without performing additional alignments that have done. Just typing the following, a new RBH alignment will be done

addaln -add /home/chuand/new_gene/virilis/fasta/dbusckii.fasta -br B1 -ct dvirilis.ctl
-add: specify the genome path that user wants to add
-br: specify the branch that the newly add genome belong to
-ct: tell the path of control.

After this procedure is done, add the following to user’s previous control file

reference[“B1”] = [/home/chuand/new_gene/virilis/fasta/dbusckii.fasta]

Then, perform the age dating procedure (the third stage) by typing the following, Noting that dvirilis.ctl contains the newly added reference species and the updated branch list.

GageTracker dvirilis.ctl -da -p 5
or
GageTracker dvirilis.ctl -step3 -p 5

4.2. Tip2: get the gene age of different gene type

We provide a tool, gage_diff.py (get gene age of different annotation type), to filter gene age according to user’s needed. Just type the gene type according to the prompt message from the tool. Bellowing is an example

gage_diff.py test.ctl

You can select the following gene type:
1       tRNA
2       rRNA
3       transcribed_pseudogene
4       pseudogene
5       snRNA
6       lncRNA
7       guide_RNA
8       snoRNA
9       protein_coding

Select the gene type listed above: lncRNA

Your output is stored in the outpath that specifies in ctl.

The above example will give the gene age list of lncRNA, which stored in the path specified by ctl file. In the same way, users can also choose the age of protein-coding genes from the annotation results. Noting that the results is filtered by repeat sequences ratio among the total exon length of a gene.

5. Output

The output contains four key columns (Confidence, Branch, Chromosome and GeneMaskRatio) and several supplementary columns. In “Confidence”, CON means the alignment is not detected in sequencing gaps and NCON means the alignment is detected in sequencing gaps, therefore such genes marked by NCON should be considered as unreliable, which means that such genes are deemed as young genes not because it can not be found in out group species, but because of the sequencing quality.

6. Some error messages and solutions

Bug fix

References

  1. Zhang YE, Vibranovski MD, Krinsky BH et al. Age-dependent chromosomal distribution of male-biased genes in Drosophila, Genome Res 2010;20:1526-1533.
  2. Shao Y, Chen C, Shen H, et al. GenTree, an integrated resource for analyzing the evolution and function of primate-specific coding genes[J]. Genome research, 2019, 29(4): 682-696.
  3. Fang CC#, Dong C# et al. GageTracker: a tool for dating gene age by micro- and macro-synteny with high speed and accuracy (unpublished article)
  4. Dong C, Zhang L, Xia S, et al. New gene evolution with subcellular expression patterns detected in PacBio-sequenced genomes of Drosophila genus[J]. bioRxiv, 2022: 2022.11. 30.518489.