CarineRey / pcoc

Convergent substitution detection tool based on the PCOC model
GNU General Public License v3.0
30 stars 16 forks source link

PCOC

Build Status

(For full explanation of the pipeline, see the PCOC paper).

PCOC is a convergent substitution detection tool with two main facets:


Table of contents:


I. Using PCOC to detect convergence in an empirical dataset

1. Prerequisite on input data

The user needs:

Once you have these data, you can install PCOC.

If you don't have your data yet, you can use the dataset from (Besnard et al., 2009) used in the PCOC paper and courteously provided by the authors:

Besnard, G., Muasya, A. M., Russier, F., Roalson, E. H., Salamin, N., and Christin, P.-A. 2009. Phylogenomics of C4 Photosynthesis in Sedges (Cyperaceae): Multiple Appearances and Genetic Convergence. Molecular Biology and Evolution, 26(8): 1909–1919, https://doi.org/10.1093/molbev/msp103

#get the PEPC protein alignment for sedges (plant species at C3/C4 transition), and rename this alignment with shorter name
wget  https://raw.githubusercontent.com/CarineRey/pcoc/master/data/det/cyp_coding.aa.coor_mays.fa
mv cyp_coding.aa.coor_mays.fa ali.fa

#get the corresponding tree, rename it and move it to a new folder
mkdir -p tree_dir
wget https://raw.githubusercontent.com/CarineRey/pcoc/master/data/det/cyp_coding.phy_phyml_tree.txt -P tree_dir
mv tree_dir/cyp_coding.phy_phyml_tree.txt tree_dir/tree.nw

Of note: the gene tree needs to have the exact same labels (same number of labels, same names) as the alignment.

2. PCOC Installation

Docker

PCOC relies on the Bpp suite (https://github.com/BioPP/bppsuite) which is in constant development. In order to avoid installation and compilation of the latest version on your machine (this can take a long time), we suggest you to use Docker. Of course, you can also use PCOC without Docker but Docker is the easiest way to use the PCOC toolkit locally. Docker will create a local environment on your computer that will contain PCOC dependencies (Bpp and python with some modules (ete3 and Biopython)).They will all be packaged within the PCOC Docker image.

If you don't have Docker on your machine, get it here first. (Be aware that installation might differ if you're a Linux, a Mac or a Windows user.)

To download or update the docker image you have to type the following command line:

docker pull carinerey/pcoc

Note: Mac users may first need to start Docker using the Docker application (if it's not running in background). If not, you could get the error message Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.

Then, to use PCOC through a docker container (containing the PCOC image), you have to type:

# run as user (to have user permission on output files) ##NOT WORKING
docker run -e LOCAL_USER_ID=`id -u $USER` --rm -v $PWD:$PWD -e CWD=$PWD carinerey/pcoc [SOME PCOC TOOL] [SOME PCOC OPTIONS]

What does this command line mean?

Below, we will shorten this command line by passing all the Docker options into an environment variable.

CMD_PCOC_DOCKER="docker run -e LOCAL_USER_ID=`id -u $USER` --rm -v $PWD:$PWD -e CWD=$PWD carinerey/pcoc"

Thus, to run PCOC, simply type:

$CMD_PCOC_DOCKER [SOME PCOC TOOL] [SOME PCOC OPTIONS]

Singularity

PCOC can also be run under Singularity, a container system designed specifically for use on shared compute clusters. By default, Singularity containers run as the current user, not root, so the environment settings and current directory mappings are not necessary.

Unlike docker, Singularity "pull" will create an image in the current working directory that is referenced directly.

singularity pull docker://carinerey/pcoc
singularity exec ./pcoc_latest.sif [SOME PCOC TOOL] [SOME PCOC OPTIONS]

3. Usage

A. Prepare tree for detection analysis: set putative convergent leaves/nodes

Based on biological relevance of the studied taxon, each user determines manually where the convergent transitions occurred, and which nodes are assumed to be in the convergent phenotypic state. This is done with the help of the pcoc_num_tree.py script.

Note that PCOC aims to detect sites with convergent substitutions in the case of a given set of transitions but not to identify the set of convergent transitions.

i. The 1st step is to "number" the tree: each node in the input tree is labeled and given a number.

$CMD_PCOC_DOCKER pcoc_num_tree.py -t tree_dir/tree.nw -o num_tree.pdf ##takes a few seconds; replace file name of tree.nw with your input tree

ii. In the 2nd step, you use the numbered tree (output of the 1st step) to write a string corresponding to the convergent scenario you are interested in.

The string looks like 1,2,3/55,56/67. The syntax is defined as such:

In the example dataset used in the PCOC paper, species with the convergent phenotypes are (see the figure in num_tree.pdf for the node IDs):

1st convergent clade 2nd convergent clade 3rd convergent clade 4th convergent clade 5th convergent clade
Ele.bald (node 0) Bulbostyl (38) Fimb.li2 (45) Cyp.capi (78) Rhy.rubr (133)
Ele.bal2 (node 1) all internal nodes in this group Fimb.di2 (46) Volkiell (79) Rhy.glob (134)
Ele.bal4 (node 3) Fimb.fer (47) Cyp.ust2 (80) Rhy.glo2 (135)
Ele.vivi (node 5) all internal nodes in this group Remirea (81) all internal nodes in this group
Ele.vivA (node 6) Cyp.iria (82)
all internal nodes in this group (nodes 2, 4, 7 and 8) Killinga (83)
Pycreus (84)
Cyp.long (86)
Cyp.rotu (87)
Cyp.papy (89)
Cyp.ustu (90)
all internal nodes in this group

so the string corresponding to this scenario is:

8,0,1,2,3,4,5,6,7/49,45,46,47,48/38/98,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97/137,133,134,135,136.

(Remember that the common ancestor of each clade with a convergent transition has to be the first in the list.)

iii. (Optional) The 3rd step is to visualize the final tree with colored labeled branches (to ascertain that no branch has been forgotten during 2nd step)

# this takes a few seconds
scenario="8,0,1,2,3,4,5,6,7/49,45,46,47,48/38/98,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97/137,133,134,135,136"
$CMD_PCOC_DOCKER pcoc_num_tree.py -t tree_dir/tree.nw -o num_tree.pdf -m $scenario

B. Run detection analysis

This is done by the pcoc_det.py script. Please see the man page for the full list of options: $CMD_PCOC_DOCKER pcoc_det.py -h

Minimal usage:

To run the detection tool of the PCOC toolkit, you need the amino-acid alignment of the studied gene, the "species" tree with values for branch lengths, and the string corresponding to the convergent scenario (obtained from the previous section).

With the option -f, you can fix the threshold at which a position is retained if its posterior probability is above this threshold for one of the 3 implemented models (PCOC, PC or OC).

# here we set the threshold to 0.8
# this takes one minute maximum
$CMD_PCOC_DOCKER pcoc_det.py -t tree_dir/tree.nw -aa ali.fa -o output_pcoc_det -m $scenario -f 0.8

4. Output files

Results are in the folder output_pcoc_det/RUN_yyyymmdd_hhmmss/ where yyyymmdd_hhmmss corresponds to the date and time of the analysis.

You will find:

Of note: if you want to get the alignment of filtered positions only, add the --no_cleanup_fasta option to your command. This will create a directory named fasta with these particular positions for each model (PCOC, PC and OC) and the union of them. You will most probably only use the output from the PCOC model, but the PC and OC outputs are also provided.

Graphical output options:

5. Using simulation to calculate the statistical power of PCOC on your dataset

Once you got the candidate convergent sites for your gene, you may want to known the sensitivity (True Positive Rate; TPR) and the specificity or maybe the opposite, the False Positive Rate (FPR) of PCOC in your own dataset.

For that, you can use the PCOC simulation and benchmark part using the same tree and same parameters that were used for detection. It will perform simulations of a large number of sites with convergent evolution, and of sites without convergent evolution and provide the amount of true positives and false negatives.

$CMD_PCOC_DOCKER pcoc_sim.py -t  tree_dir/ -o  output_pcoc_det_sim -m $scenario -c 5  -n_sc 1 --pcoc -nb_sampled_couple 10 -n_sites 100

(For more details about pcoc_sim.py options see the next section.)

In brief, for 10 random couple of profiles (-nb_sampled_couple 10), pcoc_sim.py will simulate two alignments of 100 sites (-n_sites 100):

Then, pcoc_sim.py will detect convergent evolution in both alignments.

For each couple of ancestral and convergent profiles, the convergent evolution scenario will contain the 5 convergent transitions (-c 5) defined in $scenario.

You can add --ident or --topo to test the Topological and Identical methods (see PCOC paper for more details).

In the output directory, you fill find a tabular file Tree_1/BenchmarkResults.tsv, which contains the number of FP and TP for each method used, for each couple of profiles and for different thresholds.

You have just to average all the lines for a given method and a given threshold to get your TPR and FPR.

For example, we have:

In our example, if we take a threshold equal to 0.99 and as we used n_sites_simulations=100, you should found something like :

Then, to get the expected number of FP in your dataset (FP_ali), you have to calculate the expected number of positive and negative sites (P_ali and F_ali). If you take, as in the PCOC paper, an expectd proportion of 2% of convergent sites in your data, you will have:

This figure (2%) is extracted from the paper of Thomas and Hahn (2015) where they found 140 truly convergent sites among 6000 sites.

And so, you will have:

Finally, you should find a very low False Discovery Rate (FDR):

II. Using PCOC to simulate a dataset and compare methods for detecting convergent evolution on it

If you don't have PCOC yet, see the installation procedure in the first section

PCOC allows users to simulate a dataset and compare PCOC with other convergence detection tools (using the simulated dataset). This PCOC feature can be used in 2 main situations:

In both cases, the options will be exactly the same. The only thing that differs between these approaches is the definition of the convergence scenario.

(i) In the first situation, you provide a candidate convergence scenario (relevant to the biology of the studied species). In practice, you have a species tree with given convergent events and you want to test the power of convergent detection tools on this particular topology.

(ii) In the second situation, the convergence scenario will be randomly generated. You have one (or more) species tree(s) and you want to compare the power of PCOC with other convergence detection tools under random convergent scenarios on this (these) topologies.

1. Usage

The PCOC simulation and benchmark part is done by the pcoc_sim.py script.

(see man page for full list of options: $CMD_PCOC_DOCKER pcoc_sim.py -h)

If you don't have a species tree yet, you can use the dataset from [Besnard et al., 2009](https://doi.org/10.1093/molbev/msp103) used in the PCOC paper and courteously provided by the authors:

#get the corresponding tree, rename it and move it to a new folder
mkdir -p tree_dir
wget https://raw.githubusercontent.com/CarineRey/pcoc/master/data/det/cyp_coding.phy_phyml_tree.txt -P tree_dir
mv tree_dir/cyp_coding.phy_phyml_tree.txt tree_dir/tree.nw

The other species trees used in the PCOC paper are available here: https://github.com/CarineRey/pcoc/tree/master/data/sim/tree_dir

A typical PCOC simulation and detection pipeline is composed of three main steps:

Thus, PCOC command-lines will be divided into "blocks" that correspond to the different steps in this pipeline. Prior to each analysis, you will have to define the directory containing your tree(s) (-td option) and the output directory (-o option).

$CMD_PCOC_DOCKER pcoc_sim.py -td tree_dir -o output_pcoc_sim [REST OF OPTIONS]

Below, we'll go through all the three "steps"... step by step.

A. Define the convergent scenarios

In both situations (user-defined or random scenario), PCOC will randomly choose ONE number of convergent events between a minimum (c_min) and a maximum number (c_max). If you use -c, the number of convergent events will be fixed to -c. (c can still be used in combination with c_max to pick -c convergent events among -c_max events.)

To run PCOC multiple times and test different scenarios, use the -n_sc option. For instance, to repeat the test 10 times:

 pcoc_sim.py [...]  -n_sc 10  [...]

PCOC can not modify the species tree topology but can modify the branch lengths. You can:

B. Simulate the data according to the convergent scenario

Once PCOC has placed the convergent events on the species tree, it will use the tree to generate:

Of note: the length of the alignments can be set with -n_sites.

Sequence generation is based on the evolution of sequences from the root to the leaves of the tree according to amino-acid profiles (i. e. sets of amino-acid frequencies). PCOC uses the C60 profiles (containing 60 amino-acid profiles) defined in (Si Quang et al, 2009). Among those 60 profiles, 1 couple is picked and the first profile is qualified as "ancestor" and the second as "convergent".

In the convergent scenario, PCOC "traverses" the tree starting from the root with the ancestor profile. Once it reaches a node with a convergent transition, PCOC switches from the ancestor to the convergent profile. The other nodes keep the ancestral profile. In the non-convergent scenario, the ancestor profile is applied to all nodes.

Alternatively, you can use the C10 profiles (containing 10 amino-acid profiles) instead of C60 using -CATX_sim 10.

By default, only one couple of profiles is used for a scenario but you can use the -nb_sampled_couple to increase this number. This will generate multiple sets of independent alignments corresponding to the same convergent scenario.

To increase the realism of the simulation, you can add some noise in the simulated data. You can either:

C. Test for convergence in the simulated data using different methods (including PCOC)

Finally, pcoc_sim.py can use and compare 3 methods to detect convergence in the simulated dataset (see PCOC paper for more details):

For instance, to compare the PCOC method and the topological method only, use

pcoc_sim.py [...]  -pcoc -topo #do not include -ident

By default, C10 amino-acid profiles are used in the estimation part instead of the C60 amino-acid profiles in the simulation part. Alternatively, you can use the C60 profiles instead of C10 using -CATX_est 60. Note that -CATX_est is different from -CATX_sim used in the simulation part.

2. Output files

To have an example output, you can use the following command line adapted from the example (see below).

$CMD_PCOC_DOCKER pcoc_sim.py -td tree_dir -o output_pcoc_sim -n_sc 1 -nb_sampled_couple 1 -n_sites 50 -c_min 2 -c_max 7 -c 5 --pcoc --ident --topo

Results are in the folder output_pcoc_sim/RUN_yyyymmdd_hhmmss/ where yyyymmdd_hhmmss corresponds to the date and time of the analysis.

You will find:

3. Some examples taken from the PCOC paper

In this section, some command-lines used in the PCOC paper will be used as examples but option values have been changed to reduce the computational time.

The 2nd figure of the paper addresses the impact of:

  1. the number of convergent events on the ability to detect convergence.
$CMD_PCOC_DOCKER pcoc_sim.py -td tree_dir -o output_pcoc_sim -n_sc 180 -nb_sampled_couple 1 -n_sites 50 -c_min 2 -c_max 7 -cpu 8 --pcoc --ident --topo -CATX_est 10 -CATX_sim 60
  1. the impact of branch lengths on the ability to detect convergence.
bl_l="0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1"
for bl in $bl_l
do
$CMD_PCOC_DOCKER pcoc_sim.py -td tree_dir -o output_pcoc_sim -n_sc 32 -nb_sampled_couple 1 -n_sites 50 -c_min 2 -c 5 -c_max 7 -cpu 8 --pcoc --ident --topo  -CATX_est 10 -CATX_sim 60 -bl_new $bl
done

Some figures of the supplementary data address the robustness of the PCOC models towards some noisy simulated dataset.

  1. noise in the alignment (for example Supplementary figure S8)
$CMD_PCOC_DOCKER pcoc_sim.py -td tree_dir -o output_pcoc_sim -n_sc 180 -nb_sampled_couple 1 -n_sites 50 -c_min 2 -c_max 7 -cpu 8 --pcoc --ident --topo -CATX_est 10 -CATX_sim 60 --ali_noise
  1. noise add in the branch length after simulate the alignment (for example Supplementary figure S9)
$CMD_PCOC_DOCKER pcoc_sim.py -td tree_dir -o output_pcoc_sim -n_sc 180 -nb_sampled_couple 1 -n_sites 50 -c_min 2 -c_max 7 -cpu 8 --pcoc --ident --topo -CATX_est 10 -CATX_sim 60 --bl_noise

Have fun with !!