Kevin-Haigis-Lab / speclet

A Bayesian hierarchical model to discover tissue-specific cancer driver genes and synthetic lethal interactions from CRISPR/Cas9 LoF screens.
GNU General Public License v3.0
0 stars 0 forks source link
bayesian bayesian-data-analysis crispr-cas9 hierarchical-models jax linear-models mcmc pymc python statistical-models

speclet - A Bayesian hierarchical model to discover tissue-specific cancer driver genes and synthetic lethal interactions from CRISPR/Cas9 LoF screens

python jupyerlab
project-build pytest
Tested with Hypothesis Code style: black snakefmt: black pre-commit Checked with mypy pydocstyle
License: GPLv3

speclet model diagram

The speclet model accounts for cell line- and chromosome-specific differences while simultaneously measuring the effect of targeting each gene across multiple molecular covariates including copy number, mRNA expression, and mutation status. The effect of the presence of mutations to key driver and tumor suppressor genes is also included to identify putative synthetic lethal interactions. The results of this project have been published in Chapter 4 of my Ph.D. dissertation available here: "Studying the tissue-specificity of cancer driver genes through KRAS and genetic dependency screens" (link to come soon).


Setup

Many setup and running commands have been added as make commands. Run make help to see the options available.

Python virtual environments

There are two 'conda' environments for this project: the first speclet for modeling and analysis, the second speclet_smk for the pipelines. They can be created using the following commands. Here, we use 'mamba' as a drop-in replacement for 'conda' to speed up the installation process.

conda install -n base -c conda-forge mamba
mamba env create -f conda.yaml
mamba env create -f conda_smk.yaml

Either environment can then be used like a normal 'conda' environment. For example, below is the command it activate the speclet environment.

conda activate speclet

Alternatively, the above commands can be accomplished using the make pyenvs command.

# Same as above.
make pyenvs

On O2, because I don't have control over the base conda environment, I follow the incantations below for each environment:

conda create -n speclet --yes -c conda-forge python=3.9 mamba
conda activate speclet && mamba env update --name speclet --file conda.yaml

In addition to that fun, there is also a problem with installing Python 3.10 on the installed version of conda, so I find I need to instead install 3.9 and then let the mamba install step update it.

GPU

Some additions to the environment need to be made in order to use a GPU for sampling from posterior distributions with the JAX backend in PyMC. There are instructions provided on the JAX GitHub repo and the PyMC repo First, the cuda and cudnn libraries need to be installed. Second, a specific distribution of jax should be installed. At the time of writing, the following commands work, but I would recommend consulting the two links above if doing this again in the future.

mamba install --yes -c nvidia "cuda>=11.1" "cudnn>=8.2"
pip install --upgrade "jax[cuda]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

These commands have been added to the Makefile under the command make gpu. Use the same commands with the speclet_smk environment active to be able to use the GPU in the pipelines.

R environment

The 'renv' package is used to manage the R packages. R is only used for data processing in this project. The environment can be setup in multiple ways. The first is by entering R and following the prompts to install the necessary packages. Another option is to install 'renv' and running its restore command, as shown below in the R console.

install.packages("renv")
renv::restore()

This can simply be accomplished with the following make command.

make renv

Confirm installation

Installation of the Python virtual environment can be confirmed by running the 'speclet' test suite.

conda activate speclet
pytest
# Alternatively
make test  # or make test_o2 if on O2 HPC

Pre-commit

If you plan to work on the code in this project, I recommend install 'precommit' so that all git commits are first checked for various style and code features. The package is included in the speclet virtual environment so you just need to run the following command once.

precommit install

Configuration

Project configuration YAML

There are options for configuration in the "project-config.yaml" file. There are controls for various constants and parameters for analyses and pipelines. Most are intuitively named.

Environment variables

There is a required ".env" file that should be configured as follows.

PROJECT_ROOT=${PWD}                                 # location of the root directory
PROJECT_CONFIG=${PROJECT_ROOT}/project-config.yaml  # location of project config file

An optional global environment that is used by 'speclet' is AESARA_GCC_FLAG to set any desired Aesara gcc/g++ flags in the pipelines. I need to have it set so that theano uses the correct gcc and blas modules when running in pipelines on O2 (see issue #151 for details).

Project organization

Data preparation

The data is downloaded to the "data/" directory and prepared in the "munge/" directory. The prepared data is available in "modeling_data/". Please see the READMEs in the respective directories for more information.

All of the data can be downloaded and prepared using the following commands.

make download_data
make munge # or `make munge_o2` if on O2 HPC

Notebooks

Exploration and analyses are conducted in the "notebooks/" directory. Subdirectories divide related notebooks. See the README in that directory for further details.

Python Module

All shared Python code is contained in the "speclet/" directory. The installation of this directory as an editable module should be done automatically when the conda environment is created. If this failed, the module can be installed using the following command.

# Run only if the module was not automatically installed by conda.
pip install -e .

The modules are tested using 'pytest' – see below for how to run the tests. They also conform to the 'black' and 'isort' formatters and make heavy use of Python's type-hinting system checked by 'mypy'. The functions are well documented using the Google documentation style and are checked by 'pydocstyle'.

Pipelines

All pipelines and associated files (e.g. configurations and runners) are in the "pipelines/" directory. Each pipeline contains an associated bash script and make command that can be used to run the pipeline (usually on O2). See the README in the "pipelines/" directory for more information.

Reports

Standardized reports are available in the "reports/" directory. Each analysis pipeline has a corresponding subdirectory in the reports directory. These notebooks are meant as quick, standardized reports to check on the results of a pipeline. More detailed analyses are in the "notebooks/" section.

Presentations

Presentations that involved this project are stored in the "presentations/" directory. More information is available in the README in that directory.

Testing

Tests in the "tests/" directory have been written against the modules in "speclet/" using 'pytest' and 'hypothesis'. They can be run using the following command.

# Run full test suite.
pytest
# Or run the tests in two groups simultaneously.
make test  # `test_o2` on O2 HPC

The coverage report can be shown by adding the --cov="speclet" flag. Some tests are slow because they involve the creation of models or sampling/fitting them. These can be skipped using the -m "not slow" flag. Some tests require the ability to construct plots (using the 'matplotlib' library), but not all platforms (notably the HMS research computing cluster) provide this ability. These tests can be skipped using the -m "not plots" flag.

These tests are automatically run on GitHub Actions on pushes or PRs with the master git branch. The most recent results can be seen here.

Running analyses

Pipelines

Each individual pipeline can be run through a bash script or a make command. See the pipelines README for full details.

Notebooks

The notebooks contain the analyses of the models and additional exploration of the data and other model designs. See the "notebooks/" directory for information the running these analyses.

Full project build

The entire project can be installed from scratch and all analysis run with the following make command.

make build  # or `build_o2` on the O2 HPC