Roth-Lab / pyclone

Probabilistic model for inferring clonal population structure from deep NGS sequencing.
https://bitbucket.org/aroth85/pyclone/wiki/Home
Other
99 stars 37 forks source link

License

PyClone is free for academic/non-profit use. For commercial use please contact sshah@bccrc.ca. Consult the LICENSE.txt file for more details.

Installation

Using conda

You can install PyClone using bioconda.

conda install pyclone -c bioconda -c conda-forge

This will install PyClone into your current conda environment. In some cases it may be better to create a separate conda environment for PyClone which be activated when needed. This avoids issues due to conflicting libraries. To create the environment execute the following command.

conda create -n pyclone -c bioconda -c conda-forge pyclone

Once the environment is created it can be activated using the following command.

conda activate pyclone

You can check that PyClone was installed correctly by running the following command which will show the help.

PyClone --help

From source

PyClone is standard Python package. You can find a list of dependencies in the conda recipe here. You will need to install PyDP from source as well, which can be found here.

Usage

Input format

The majority of users will use PyClone by creating a set of tab delimited (tsv) input files, one file for each sample from the cancer. The mandatory columns of this files are as follows.

Any other columns will be ignored. Example files are found here from the mixing dataset used in the original PyClone paper.

Basic usage

The easiest way to run PyClone is using the PyClone run_analysis_pipeline pipeline command. This will perform the steps to pre-process the input files, run the MCMC analysis and do the post-processing and plotting. You will need to generate the input files as specified in the previous section. You will need to pass two mandatory arguments.

Two important optional flags are:

Additional arguments are available and can be listed using PyClone run_analysis_pipeline --help.

Advanced usage

In some cases the run_analysis_pipeline pipeline command can fail. This usually happens when a large number of mutations are input into the software which causes the plotting code to fail. In this case users can semi-manually run the steps of PyClone. The commands required are:

  1. PyClone setup_analysis: This will create the correctly formatted yaml input files for the MCMC analysis. Run PyClone setup_analysis --help to see the list of arguments. They are similar to PyClone run_analysis_pipeline.

  2. PyClone run_analysis: This will run the MCMC analysis. Run PyClone run_analysis --help to see a list of supported arguments.

  3. PyClone build_table: This will post-process the MCMC trace and build a results file. Run PyClone build_table --help to see supported arguments.

There are two additional commands for plotting PyClone plot_clusters and PyClone plot_loci. The commands are not optimized for plotting large datasets with 1000s of mutations so they may crash or produce plots that do not look great. The best option in this case is to use the PyClone build_table and write some custom plotting code to show the desired result. The output tsv files can easily be loaded into Python or R for plotting.

Common issues/mistakes

  1. Non-overlapping mutation ids between samples. PyClone will intersect the set of mutations found in the input tsv files for each sample. If no mutations are shared between the files then the analysis will fail. There are two common reasons this occurs. First, users append a sample ID to the mutation_id i.e. mutation m1 is called m1_s1 in sample s1 and m1_s2 in sample s2. PyClone will see these as two different mutations. The second issue is that the variant caller used fails to identify a mutation in one sample. In this case the user should manually retrieve the allele counts for the mutation in that sample and add the entry for the mutation to the sample input tsv file.

  2. Major copy number of 0. PyClone will remove mutations with major copy number of 0. The rational is that if the malignant cells have no copies of the region overlapping the locus, the mutation cannot exist.

  3. Large input files. PyClone was initially designed for use with small deeply sequenced panels of mutations. Typically using more than a few hundred mutations will decrease the performance of the method, both in terms of run time and in terms of accuracy. To speed up the analysis use the --init_method argument and set it to connected. To improve accuracy increase the number of MCMC iterations using the --num_iters argument.

Limitations

There are few limitations to consider when using PyClone.

  1. Single sample analysis. Performance dramatically increases if additional samples are used. This was demonstrated in the original PyClone paper. In general single sample analysis will yield poor performance, which will be made worse if the sequencing depth is low such as from WGS or exome data. This is a general feature of the clonal inference problem and affects all tools.

  2. No tree. PyClone does not infer a clonal phylogeny, or evolutionary tree. Several methods such citup can use the output of PyClone to reconstruct trees. Alternatively methods such as PhyloWGS directly infer tree structures.

Versions

0.13.1

0.13.0

Most changes in this release are internal refactoring of the code and should be invisble to the user.

0.12.9

0.12.8

0.12.7

0.12.6

0.12.5

0.12.4

0.12.3

0.12.2

0.12.1

0.12.0

0.11.3

0.11.1

0.11

Older