cerebis / meta-sweeper

Parametric sweep of simulated microbial communities and metagenomic sequencing.
GNU General Public License v3.0
10 stars 0 forks source link

Meta-Sweeper

Table of Contents

  1. Introduction
  2. Prerequisites
  3. Installation
  4. Setup
  5. Workflow Invocation
  6. Sweep Definition
  7. Implemented Workflows
  8. Included Tools

Introduction

Meta-Sweeper is a tool for conducting parametric (combinatorial) sweeps of selected experimental variables relevant to metagenomics research. In particular, we have implemented workflows which enable the generation of simulated microbial communities and their simulated sequencing (WGS and HiC/3C). Downstream of simulation, specific topics of analysis in HiC/3C clustering and Non-negative matrix factorization have been provided. Built upon the Nextflow framework, Meta-Sweeper is general enough that there is no reason that alternative workflows or extensions to existing workflows could not be implemented.

Meta-Sweeper is the culmination and refinement of a central process employed in our ongoing work exploring how to obtain the maximum value when integrating HiC/3C sequencing data with conventional whole-genome shotgun DNA sequencing. To accomplish this, we conducted a parametric sweep of select experimental and model parameters such as sequencing depth, community composition and evolutionary divergence. At each sampled point in the sweep, we pitted a set of algorithms against each-other, where their aim was (in a sense) the deconvolution of the simulated community. This work culminated in the publication: Deconvoluting simulated metagenomes: the performance of hard- and soft- clustering algorithms applied to metagenomic chromosome conformation capture (3C).

Parametric sweeps, and the sampling of parameter spaces in general, have reoccurring applicability within bioinformatics beyond our own topic of research. We believe that this systematic approach is potentially useful to a broad range of research topics, offering a means for more thorough quality assurance and performance testing. Aside from explicit testing, the process of sampling a parameter space can also be used simply as an exploratory technique. In all cases, a reproducible approach is preferable to one-off ad-hoc approaches that can prove difficult to reproduce even for the original authors.

Some suggested applications are such things as: algorithm development, the exploration of experimental requirements for new techniques or simply the comparative assessment of existing tools.

Although numerous workflow management systems exist, we feel there is a lack of facility for parametric sweeps. Further, ease of deployment to various high-throughput computing environments is a crucial feature, when bioinformatics tasks are often computationally expensive in terms of CPU and/or memory resources.

To that end, we invested the time to resolve our processes into a more coherent and easily deployed system. We hope this will prove useful to others.

Prerequisites

Installation

Nextflow

The Nextflow framework, which can be installed easily using either of the following:

wget -qO- get.nextflow.io | bash or curl -fsSL get.nextflow.io | bash

This assumes you have met the prerequisite of installing Java 8+. Note, you will need to have wget or curl installed depending on your choice above. Whether either is installed by default is dependent on which distribution of Linux you are using.

In addition, meta-sweeper expects that the main executable nextflow is accessible on the path. Users can move this file to a location already on the path or add its parent directory to the path.

Cloning MetaSweeper repository

We are employing submodules, therefore when cloning the repository it is easiest to add the recurisve option.

git clone --recursive https://github.com/cerebis/meta-sweeper.git

Python modules

The workflows depend on the following Python modules, which must be installed prior to using it:

Some of these dependencies can be installed using distributional package manager, but not all will be found.

On ubuntu

sudo apt-get install python-biopython python-pandas python-yaml python-networkx python-pysam

On Redhat

sudo yum install python-biopython python-pandas python-yaml python-networkx python-pysam

A more general and complete solution is to use pip.

Using '--upgrade' will ensure that the current version of each module is installed. Though we have not encountered this problem, a note of caution. Upgrading Python modules through Pip, and thus outside of your system's package manager, does incur the potential risk of version conflicts if your installed system packages depend on obsolete or otherwise deprecated functionality with a given Python module. If in doubt, you could try the same command below but omit the '--upgrade' option.

Installing requirements with pip

pip install --upgrade -r requirements.txt

Setup

Configuration Steps

Before running a meta-sweeper workflow, you must initialise the shell environment.

  1. Set meta-sweepers home to your installed location.
    • If meta-sweeper was installed in /home/user/meta-sweeper.
    • export METASWEEPER_HOME=/home/user/meta-sweeper
  2. Append the meta-sweeper home patht to nextflow's classpath.
    • export NXF_CLASSPATH=$NXF_CLASSPATH:$METASWEEPER_HOME
  3. Check that nextflow is installed and has been added to the path.
    • Either of the following should return the canonical path to the nextflow executable.
    • command -v nextflow or which nextflow
  4. A further dependency on beagle-lib exists for the timeseries workflow.

    To use this workflow, users must set the environmental variable BEAGLE_LIBDIR to point to the directory containing libhmsbeagle-jni.so shared library file.

    E.g. BEAGLE_LIBDIR=/usr/lib

    For some Linux distributions, beagle-lib can be satisfied through the system package manager. In other cases, users will need to download and install beagle-lib manaully. Please be certain to their documentation and make sure that all listed prerequisites described therein are met prior to attempting compilation.

Automated Setup

We have provided a Bash script which attempts to automate and verify the setup process. Please Note: as we are initialising environmental variables for your current shell, you must source this script (. bash_configure OR source bash_configure). Do not instead make it executable, as running it (./bash_configure) will not result in changes to your shell's environment.

Sourcing setup script

. bash_configure

Users should pay attention to the output from the script. We attempt to highlight problems in red.

Workflow Invocation

After initialisation, users may start sweeps using standard Nextflow invocation syntax or simply execute the chosen workflow scrip itself. Users will most likely wish to modify sweep to suit their own purposes, see below for an explanation of how a sweep is defined.

Running a workflow

Any workflow can be started either using the following syntax:

nextflow run [options] <workflow>

For instance, running Metagenomic-Hic stage 1 can be like so:

nextflow run chrcc-sweep.nf

or even more simply, by treating any workflows as an executable:

./chrcc-sweep.nf

It is worth mentioning here that not all of our workflow scripts below are completely independent. For instance, Metagenomic Chromosome Conformation Capture is broken into three stages, with the second and third stages depending on the results of the previous stage. Therefore users must begin with stage 1 in this case.

Note: the nature of mixing concurrency and potentially resource hungry processes (such as genome assembly) can mean that a simple local execution strategy may result in resource starvation and subsequently premature program termination. It is recommended, in the long run, that users make use of a Nextflow supported distributed resource manager (DRM) such as SGE, SLURM, etc.

When a DRM is available as a target of execution, workflows can be easily submitted with only minimal configurational change to execute on potentially many physical nodes. Most frequently, this entails changes to the queue name for the target system. Even with only a single physical machine however, a well configured DRM can take better responsibility for managing available compute resources in completing a workflow's set of tasks.

Common options

Four options which users may frequently employ are (note: single hyphen) as follows.

-profile [name]

Shorthand for including additional configuration details at runtime (perhaps when using different execution environments) and predefined in a nextflow configuration file under a given label [name]. For simplicity, these can be placed in the default nextflow.config and we have provided a few examples for a few different execution targets described below.

-resume

For many reasons, long running workflows may be interrupted either inadvertently or intentionally. For convenience and potentially a large time-saver, it is possible to restart or resume such jobs.

-queue-size [Int]

Limit concurrency to an integer number [Int] of simultaneous running tasks.

-work-dir

Specify a working directory other than the default work. This can he useful when separately running multiple workflows and you wish to inspect the contents of working directories without having to also determine which sub-folders pertain to which workflow.

Further command-line options

Nextflow will describe its' base interface by invoking it alone on the command-line:

nextflow

Detailed help with each command can be accessed as follows:

nextflow help [command]

Execution Targets

For even a relatively shallow parametric sweep, the sum total of computational resources required can be significant. Scheduling systems (SGE, PBS, SLURM, etc) are ideally suited to managing this task and Nextflow makes directing execution to them easy.

Specific scheduler submission details vary, as can the resources required for individual tasks. This information (executor, queue name, cpu, memory, disk) can be encapsulated either broadly in a config profile or per-process using process directives.

We have provided a few simple examples of such profiles within the default nextflow configuration file nextflow.config, which is automatically sourced by Nextflow at invocation.

Predefined Execution Profiles

We have attempted to use the simplest DRM-appropriate defaults, but queue names in particular frequently vary between deployments. Additionally, it is possible to specify many more criteria for your particular queuing system.

profiles {

        standard {
                process.executor = 'local'
        }

        sge {
                process.executor = 'sge'
                queue = 'all.q'
                clusterOptions = '-S /bin/bash'
        }

        pbs {
                process.executor = 'pbs'
                queue = 'batch'
                clusterOptions = '-S /bin/bash'
        }

        pbspro {
                process.executor = 'pbspro'
                queue = 'workq'
                clusterOptions = '-S /bin/bash'
        }
}

Submission Examples

Submit to SGE queue manager

./chrcc-sweep.nf -profile sge

Submit to a PBS queue manager

./chrcc-sweep.nf -profile pbs

Trouble Shooting

--debug

All of our implemented workflows below accept the double-hyphen runtime option --debug.

Including this option at invocation time will test your environmental setup and the workflow processing logic but will not execute any actual bioinformatics tools. Workflows will execute very quickly in this mode and, if all is well, will complete without emitting an error. The mode creates only mock output files to meet inter-process dependencies. This mode is helpful when first configuring your system, particularly when configuring submission to a queue manager.

Nextflow Log

Nextflow writes to a hidden log file (.nextflow.log), which it manages in a rotation akin to Linux logrotate. If an error occurs during a workflow, it is very likely that Nextflow will expose it as a message to stdout, prompting the user to inspect the log file. In most cases, Nextflow will provide details on which task failed and its working directory. This directory should be the first port of call when troubleshooting a new problem. We would recommend inspecting both command.err (.command.err) and command.out (.command.out) to see if what went wrong can be ascertained. If not, inspect command.sh (.command.sh) and verify that the issued command is properly formed.

It is our experience that runtime errors are often caused by unset environmental variables. This can be as simple as inadvertantly switching shell terminals, where the second shell has not been configured. Therefore, even when you think you have done so already, please double check that you have initialized the environment as outlined above.

Some Error Sources

  1. Unset environmental variables

  2. Incorrect runtime architecture on execution host (requires x86_64)

  3. Incorrect Java JVM. Please try using Java 8 or later.

  4. Out of storage space. Sweeps can occupy significant space.

  5. Exceeding CPU or other resource limits on execution host.

  6. Missing Python dependencies for invoked interpreter.

    If using a Python interpreter other than the system default, make sure that the alternate's path (/home/foobar/bin) preceeds that of the system's default (/usr/bin).

    e.g. PATH=/home/foobar/bin:/usr/bin

  7. No internet access.

    If necessary, Meta-Sweeper uses Groovy Grape to automatically satisfy a few dependencies from internet repositories. Without access, compliation of our workflows will fail (see below).

Script Compilation Errors.

In more severe cases, the Nextflow script might itself have failed to compile (groovy scripts are built at invocation time). In this situation, errors can be harder to decrypt and we would encourage users to contact us.

Sweep Definition

Primarily, a parametric sweep is defined by a set of variables which are sampled over a predefined range and granularity. In addition, there can be many other values held constant throughout the sweep, which are otherwise necessary and relevant to declare.

Effectively, the sweep is a sample-space where it is up to the user to decide on how finely the space is explored, whether the change between points is linear or otherwise. Once decided, the user declares the set of values for each variable in a simple configuration file.

Within the sweep configuration file, all varied parameters are grouped under the 'variables' label, while the remaining fixed values are collected under the 'options' label.

For the workflows implemented below, we have opted for a separate file for each worflow. Although not be strictly necessary, it permits keeping the size and complexity of the configuration file to a minimum, while more importantly avoiding unintended side-effects when changes with regard to one workflow are inadvertently taken up in another.

To tailor a sweep your preferences, users simply to extend or reduce the number of values taken on by any parameter. Note that at least one value must be defined for any variable used in a workflow. By defining a single value, a parameter is effectively fixed in the sweep. Ambition can quickly get the best of you and users should keep in mind that fine sampling of even a few parameters can lead to an exponential explosion in the full parameter space, potentially outstripping the computational resources at hand.

For any workflow, replicates are easily generated by defining multiple seeds.

Sweep Variables

How parameters vary in a sweep are defined in the configuration file. For the implemented workflows, there are a few more common parameters which we will now mention.


Figure 1: ANIb vs Alpha
Figure 1. Average nucleotide identity as a function of scale factor αBL

Configuration File Example

Taken from Metagenomic Chromosome Conformation Capture workflow, the following is an example of a sweep configuration file in YAML syntax.

variables:
  # Level 0
  # A complete sweep is run for each random seed defined here. In this case
  # we will have 3 replicates of the entire sweep.
  seed: [1, 2, 3]
  # Level 1
  # Communities are generated for each seed.
  community: !com
    name: trial
    # A community of two clades. 
    # Clades can have any ancestral/donar sequence, but here they use the same.
    # Altogether there will be 8 taxa in the community.
    clades:
      - !clade
        prefix: clade1
        ancestor: test/ancestor.fa
        donor: test/donor.fa
        ntaxa: 5
        tree: {birth: 0.9, death: 0.5}
        profile: {mu: 0.1, sigma: 1}
      - !clade
        prefix: clade2
        ancestor: test/ancestor.fa
        donor: test/donor.fa
        ntaxa: 3
        tree: {birth: 0.7, death: 0.3}
        profile: {mu: 0.2, sigma: 1.5}
  # Level 2
  # Phylogenetic scale factor [0..1]. 
  # The smaller this value, the more closely related
  # taxa in each clade become.
  alpha: [1, 0.5, 0.1]
  # Level 3
  # WGS sequencing depth, measured in times coverage
  xfold: [1, 5, 10]
  # Level 4
  # The number of CCC read-pairs to generate 
  num_3c: [50000, 100000, 1000000]

options:
  # How many samples. For most workflows, this is set to 1.
  num_samples: 1
  # Probabilties supplied to sgEvolver when generating community sequences.
  evo:
    indel_freq: 1e-4
    small_ht_freq: 1e-4
    large_ht_freq: 1e-4
    inversion_freq: 1e-4
  # WGS sequencing.
  wgs:
    read_len: 150
    insert_mean: 450
    insert_sd: 100    
  # 3C/HiC sequencing 
  ccc:
    method: hic
    enzyme: NlaIII
    insert_mean: 500
    insert_sd: 100
    insert_max: 1000
    read_len: 150
    machine_profile: EmpMiSeq250
    create_cids: true
  # The output directory
  output: out

Output File Naming

At each point in the sweep, files which are considered outputs at various steps within a workflow are copied to the output folder [default: out]. For each such output file, the file name is used to pass-on information about the the parameters used at that given point in the sweep. This information is encoded as a structured string using two delimiters. These delimiters were chosen to avoid conflicts with system constraints, avoid unicode and be at least somewhat human readable.

This string is then prepended to the respective file name, which afterwards follows the regular convention of encoding information about the type of file.

Name/Value encoding:

Per-parameter syntax is delimited by a # character as follows:

[parameter_name]#[parameter_value]

Chaining parameters:

Multiple parameters are joined by the three character string -+-:

[parameter_name1]#[parameter_value1]-+-[parameter_name2]#[parameter_value2]

Example:

As an example, the following would be the file names for WGS reads R1/R2 pertaining to the sweep point seed=1, αBL=0.5, xfold=10:

 seed#1-+-alpha#0.5-+-xfold#10.wgs.r1.fq.gz
 seed#1-+-alpha#0.5-+-xfold#10.wgs.r2.fq.gz

Implemented Workflows

In an effort to highlight the utility of Meta-Sweeper, we have implemented three workflows covering different metagenomic analysis topics we feel applicable to study by parametric sweep. In each case, the workflow definition can be adjusted through the respective configuration file, tailoring it to user requirements.

We encourage users to modify these examples for their own purposes.

1. Metagenomic Chromosome Conformation Capture

This topic was our original motivation for creating Meta-Sweeper. The work culminated in the publication Deconvoluting simulated metagenomes: the performance of hard- and soft- clustering algorithms applied to metagenomic chromosome conformation capture (3C), where Meta-Sweeper represents a refinement of the methods employed in that work, allowing for its straightforward reproduction.

An added flexibility within this workflow is the ability to choose between the simulation of HiC or meta3C library preparation protocols. The most significant difference between the methods is that HiC employs a biotin affinity pull-down to greatly enrich for ligation products, while meta3C makes no attempt to do so. The outcome is that reads sequenced from a meta3C library will contain many conventional whole-genome shotgun (WGS) reads (>95%). Therefore, when simulating a meta3C workflow, we recommend users increase the requested sequencing yield (Eg. 10 * num_3c). At present, when running a meta3C simulation, the conventional purely WGS read-set is still created. In many real experiments, users might elect to produce only a single "meta3C enchanced" read-set, passing it to assembly as a plain WGS read-set.

The sweep is varied over five levels:

  1. Random seed
  2. Community structure
  3. Community divergence (αBL)
  4. WGS coverage (xfold)
  5. HiC ligation products (num_3c)

Sweep file: chrcc.yaml

The complete workflow has been broken into three stages, which we feel are often of separate interest.

Stage 1: Data Generation

Stage 2: Clustering

Stage 3: Aggregation

Results File Definition

The serialized YAML object contains the following entries:

Example: two sweep points from all_stats.yaml

- params: {seed: '2', alpha: '1', xfold: '1', n3c: '5000', algo: louvsoft}
  asmstat: {L50: 12, N50: 455}
  bc: {completeness: 0.964, f: 0.141, pre: 1.0, rec: 0.0757}
  geigh: {method: eigh, value: null}
  gstat: {density: 0.0741, inclusives: 28, isolates: 0, mean_deg: 2.0, median_deg: 2, modularity: 0.964, order: 28, size: 28}
- params: {seed: '3', alpha: '1', xfold: '2', n3c: '5000', algo: louvsoft}
  asmstat: {L50: 130, N50: 568}
  bc: {completeness: 0.953, f: 0.0152, pre: 1.0, rec: 0.008}
  geigh: {method: eigh, value: -0.0}
  gstat: {density: 0.006, inclusives: 342, isolates: 0, mean_deg: 2.006, median_deg: 2, modularity: 0.997, order: 342, size: 343}

which when converted to tabular format would look like:

,params-xfold,params-alpha,params-n3c,params-seed,params-algo,asmstat-L50,asmstat-N50,gstat-mean_deg,gstat-density,gstat-modularity,gstat-median_deg,gstat-inclusives,gstat-isolates,gstat-order,gstat-size,geigh-method,geigh-value,bc-pre,bc-rec,bc-completeness,bc-f
0,1,1,5000,2,louvsoft,12,455,2.0,0.0741,0.964,2,28,0,28,28,eigh,,1.0,0.0757,0.964,0.141
1,2,1,5000,3,louvsoft,130,568,2.006,0.006,0.997,2,342,0,342,343,eigh,-0.0,1.0,0.008,0.953,0.0152

2. Time-series Deconvolution

A test case for metagenome deconvolution by non-negative matrix factorisation of a time-series data-set.

This workflow is implemented as a single script. Internally, the stages of execution involve:

  1. Community creation
  2. WGS read simulation
  3. WGS read mapping
  4. NNMF based deconvolution

The sweep is varied over four levels:

  1. Random seed
  2. Community structure
  3. Community divergence (αBL)
  4. WGS coverage (xfold)

This workflow does not involve HiC sequencing data.

Script: timeseries-deconvolute.nf

Sweep file: timeseries.yaml

The final outcome of the deconvolution process for each sweep point can be found in the files following the naming strategy: ${key}.truth.report.txt. Here, the variable ${key} represents the structured string encoding information about the parameters used at a given point in the sweep. This is explained above

E.g. For seed=1, αBL=0.5 and xfold=10, the report file would be named:

 seed#1-+-alpha#0.5-+-xfold#10.truth.report.txt

Included Tools

Read Simulation

Clustering Algorithms

Darling Lab

i3 institute, UTS Sydney