chrisquince / STRONG

Strain Resolution ON Graphs
MIT License
47 stars 9 forks source link

STRONG - Strain Resolution ON Graphs

Overview

STRONG resolves strains on assembly graphs by resolving variants on core COGs using co-occurrence across multiple samples.

Table of Contents

Installation
Quick Start
Usage
Config File
Detailed Pipeline
Synthetic community data

Installation

Prerequisites

The following pieces of software should be installed on your machine before attempting to install STRONG

  • conda (miniconda)
  • cmake, zlib, GNU readline, G++

For a standard Ubuntu 16.04 distribution. The above packages would be installed as:

    sudo apt-get update
    sudo apt-get -y install libbz2-dev libreadline-dev cmake g++ zlib1g zlib1g-dev

We then need to install miniconda we recommend the Python 3.8 version. To install miniconda follow the instructions here. Remember that conda activation may require logging back in again.

Conda installation

STRONG can be installed anywhere but for the below we assume it will be placed in a location SPATH that you set as an environment variable:

export SPATH=/mypath/to/repos
cd $SPATH

We begin by cloning STRONG recursively:

git clone --recurse-submodules https://github.com/chrisquince/STRONG.git

STRONG contains DESMAN and BayesPaths as submodules.

If you need to update in future:

cd STRONG
git submodule foreach git pull origin master

Automatic installation

All the steps described below have been compiled for convenience in the install_STRONG.sh script. It is mostly silent and all logs are found in install.log. This script does not however install any databases. So please refer to corresponding section for those : Database needed (COG)

Inside the STRONG directory, type the following command:

./install_STRONG.sh 

SPAdes/DESMAN/Bayespath manual installation

We recommend that you first compile the SPAdes and COG tools executables outside of conda:

cd ./STRONG/SPAdes/assembler

./build_cog_tools.sh 

The full list of requirements is listed in the file conda_env.yaml we recommend mamba for install. This can be itself installed through conda by:

conda install -c conda-forge mamba

Then we use mamba to resolve the STRONG environment from within the STRONG home directory:

cd $SPATH/STRONG

mamba env create -f conda_env.yaml

This should take 5 - 10 minutes with mamba.

Once the STRONG environment has been installed activate it with the following command :

conda activate STRONG

It is also necessary to install the BayesPaths executable with the STRONG conda:

cd BayesPaths
python ./setup.py install

And also DESMAN:

cd ../DESMAN
python ./setup.py install

BayesPaths uses precompiled executables in the runfg_source directory. These are only compatible with Linux x86-64 and on other platforms they will require compilation from source see the BayesPaths repo for details.

Fix conda install

  1. Fix concoct refine

Unfortunately there is a bug in the conda CONCOCT package caused by updates to Pandas this needs to be fixed before running the pipeline:

CPATH=`which concoct_refine`
sed -i 's/values/to_numpy/g' $CPATH
sed -i 's/as_matrix/to_numpy/g' $CPATH
sed -i 's/int(NK), args.seed, args.threads)/ int(NK), args.seed, args.threads, 500)/g' $CPATH
  1. Fix R lapack library location item

There is a bug in the current conda install of R where the lapack library while being present is not exactly where it should be for all required library to work. It is easily fixed with symbolic link

ln -s $CONDA_PREFIX/lib/R/modules/lapack.so $CONDA_PREFIX/lib/R/modules/libRlapack.so

Database needed (COG)

We will also need a version of the COG database installed. We make this available for download and it can be placed anywhere. Here we point the DB_PATH variable to its location which should be chosen appropriately:

export DB_PATH=/path/to_my/database
cd $DB_PATH
wget https://microbial-metag-strong.s3.climb.ac.uk/rpsblast_cog_db.tar.gz
tar -xvzf rpsblast_cog_db.tar.gz
rm rpsblast_cog_db.tar.gz

Optional Database (GTDB)

GTDB is used in the last part of the pipeline as for MAG classification optionally. If the a gtdb path is given in the config file, STRONG will check naively for its presence and will download it if it is absent. We recommand preinstalling it, the actual download may take a while:

wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release95/95.0/auxillary_files/gtdbtk_r95_data.tar.gz
tar xvzf gtdbtk_r95_data.tar.gz
rm -r db
mv release95 db

Check install

Some issues may crop up with R libraries and/or forgotten installation step. This can be checked for by running SnakeNest/scripts/check_on_dependencies.py

Native installation (Not supported yet)

STRONG has a lot of required software, at the moment we recommend using the conda recipe above.

Quick start

First we will download a fairly simple synthetic test data set from known microbial strains into another directory /mypath/torunthings/STRONG_Runs that we will use for STRONG output:

export SRPATH=/mypath/torunthings/STRONG_Runs
mkdir $SRPATH
cd  $SRPATH
wget https://microbial-metag-strong.s3.climb.ac.uk/Test.tar.gz
tar -xvzf Test.tar.gz
rm Test.tar.gz

We are now ready to run STRONG from within the STRONG directory. Two example yamls are provided in the SnakeNest directory, for a high quality run of real data start from config.yaml but for this simple example use test_config.yaml which assumes a maximum of 5 strains per MAG as explained below. This file will need to be edited though. The following edits are necessary:

  1. The data directory needs to point at the samples to be assembled in this case edit:
data: /mypath/torunthings/STRONG_Runs/Test
  1. The cog_database field to:

    cog_database: /path/to_my/database/rpsblast_cog_db/Cog
  2. The evaluation genomes field which contains the known genomes to validate to

    genomes: /mypath/torunthings/STRONG_Runs/Test/Eval

    For real data this step would be deactivated by setting 'execution: 0'