how to use whaleprep? - Githubissues

ShuaiNIEgithub commented 4 years ago

The WHALE looks like a great tool for studing WGD and evolution of gene duplication and loss rates. However, it is hard to install whaleprep for me. So, could you give me same advises? Should I install nextflow, python3, PRANK, MrBayes and ALEobserve? And how to use whaleprep on my platform?

arzwa commented 4 years ago

Hi, thanks for your interest in Whale. To use the pipeline in this repository you will need python3, PRANK, MrBayes and ALEobserve as you mentioned. I would recommend following a brief tutorial on nextflow to get a basic understanding of this framework. You can then have a look at the nextflow.conf file to see whether you need to change some things in order for the pipeline to work on your system (currently the configuration is for an SGE type of computing cluster). Lastly to run the analyses, you just need to run nextflow whaleprep.nf. I usually use a script like this:

# configuration
## working directory
export NXF_WORK=/home/arzwa/whaleprep-workdir
## path to whaleprep directory
whaleprep="/home/arzwa/whaleprep/"
## path to a directory with protein fasta files for each gene family
fastadir="./fasta"

nextflow $whaleprep/whaleprep.nf --fasta $fastadir

Note that this repository is mainly available for the sake of reproducibility (these are the methods I used for our MBE paper). As you probably know, you could use many other software tools for the same purposes, and I do not particularly recommend the approach I took here over any other. So to be clear and make sure you don't feel restricted to the particular tools I used in this pipeline, I'll briefly indicate what you need to do in order to do evolutionary inference with Whale:

Get gene families for your set of species of interest (e.g. using OrthoFinder)
Get an alignment for each gene family (e.g. using PRANK, MUSCLE, MAFFT, ...)
Get a sample from the posterior distribution of phylogenetic trees for every family (e.g. using MrBayes, RevBayes, Beast), alternatively you can also use bootstrap replicates (e.g. computed with IQ-TREE or RaxML) although this is less theoretically justified. The output of this step should be a file with a sample of trees (say 10000) for each family.
Get the conditional clade distribution (CCD), using ALEobserve.
Run Whale using the resulting CCD (.ale) files

This pipeline performs steps 2-4 using a particular set of tools, but again, you can use any set of phylogenetics programs that suit the task. Note that, given the computationally intensive nature of both the preparatory steps and Whale itself, I do not recommend including more than about 10 to 15 species, and I do recommend running most analyses on subsets of ~1000 random gene families.

I hope this makes it somewhat more clear?

ShuaiNIEgithub commented 4 years ago

Excellent advice! That's what I wanna hear! THANK YOU SOOO MUCH! In addition, please wait me a minute. I am editing my other question about whale, and please give me more suggestions.

arzwa / whaleprep

how to use whaleprep? #1