CAMI-challenge / OPAL

OPAL: Open-community Profiling Assessment tooL
https://cami-challenge.github.io/OPAL/
Apache License 2.0
27 stars 6 forks source link

CircleCI install with bioconda

OPAL: Open-community Profiling Assessment tooL

Taxonomic metagenome profilers predict the presence and relative abundance of microorganisms from shotgun sequence samples of DNA isolated directly from a microbial community. Over the past years, there has been an explosive growth of software and algorithms for this task, resulting in a need for more systematic comparisons of these methods based on relevant performance criteria. OPAL implements commonly used performance metrics, including those of the first challenge of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI), together with convenient visualizations.

Computed metrics

Example pages produced by OPAL

See also

User Guide

Installation

Requirements

OPAL 1.0.12 has been tested with Python 3.10 and 3.11.

See requirements.txt for all dependencies.

Steps

You can install OPAL using Docker, Bioconda, or as follows.

Install pip if not already installed (tested on Linux Ubuntu 18.04):

sudo apt install python3-pip

Should you receive the message Unable to locate package python3-pip, enter the following commands and repeat the previous step.

sudo add-apt-repository universe
sudo apt update

Then run:

pip3 install cami-opal

Make sure to add OPAL to your PATH:

echo 'PATH=$PATH:${HOME}/.local/bin' >> ~/.bashrc
source ~/.bashrc

Inputs

*`Note: Support for the BIOM format has been dropped (temporarily) in OPAL 1.0.4 due to incompatibility with Python 3.7..`**

OPAL uses at least two files:

  1. A gold standard taxonomic profile
  2. One or more taxonomic profiles to be assessed

Files must be in the CAMI profiling Bioboxes format or in the BIOM (Biological Observation Matrix) format. Program tsv2biom.py allows to convert profiles from the former format to the latter.

The BIOM format

The BIOM format used by OPAL is a sparse matrix stored in a JSON or HDF5 file, with a column per sample and a row per taxonomy ID, storing the corresponding abundances. RANK, TAXPATH, and TAXPATHSN are stored as metadata of each row and have the same meaning as in the CAMI profiling Bioboxes format:

Running opal.py

usage: opal.py -g GOLD_STANDARD_FILE -o OUTPUT_DIR [-n] [-f FILTER] [-p] [-l LABELS] [-t TIME] [-m MEMORY] [-d DESC] [-r RANKS] [--metrics_plot_rel METRICS_PLOT_REL]
               [--metrics_plot_abs METRICS_PLOT_ABS] [--silent] [-v] [-h] [-b BRANCH_LENGTH_FUNCTION] [--normalized_unifrac]
               profiles_files [profiles_files ...]

OPAL: Open-community Profiling Assessment tooL

required arguments:
  profiles_files        Files of profiles
  -g GOLD_STANDARD_FILE, --gold_standard_file GOLD_STANDARD_FILE
                        Gold standard file
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        Directory to write the results to

optional arguments:
  -n, --normalize       Normalize samples
  -f FILTER, --filter FILTER
                        Filter out the predictions with the smallest relative abundances summing up to [FILTER]% within a rank
  -p, --plot_abundances
                        Plot abundances in the gold standard (can take some minutes)
  -l LABELS, --labels LABELS
                        Comma-separated profiles names
  -t TIME, --time TIME  Comma-separated runtimes in hours
  -m MEMORY, --memory MEMORY
                        Comma-separated memory usages in gigabytes
  -d DESC, --desc DESC  Description for HTML page
  -r RANKS, --ranks RANKS
                        Highest and lowest taxonomic ranks to consider in performance rankings, comma-separated. Valid ranks: superkingdom, phylum, class, order, family, genus, species,
                        strain (default:superkingdom,species)
  --metrics_plot_rel METRICS_PLOT_REL
                        Metrics for spider plot of relative performances, first character, comma-separated. Valid metrics: w:weighted Unifrac, l:L1 norm, c:completeness, p:purity, f:false
                        positives, t:true positives (default: w,l,c,p,f)
  --metrics_plot_abs METRICS_PLOT_ABS
                        Metrics for spider plot of absolute performances, first character, comma-separated. Valid metrics: c:completeness, p:purity, b:Bray-Curtis (default: c,p)
  --silent              Silent mode
  -v, --version         show program's version number and exit
  -h, --help            Show this help message and exit

UniFrac arguments:
  -b BRANCH_LENGTH_FUNCTION, --branch_length_function BRANCH_LENGTH_FUNCTION
                        UniFrac tree branch length function (default: "lambda x: 1/x", where x=tree depth)
  --normalized_unifrac  Compute normalized version of weighted UniFrac by dividing by the theoretical max unweighted UniFrac

Example: To run the example, please download the files given in the data directory.

./opal.py -g data/goldstandard_low_1.bin \
data/cranky_wozniak_13 \
data/grave_wright_13 \
data/furious_elion_13 \
data/focused_archimedes_13 \
data/evil_darwin_13 \
data/agitated_blackwell_7 \
data/jolly_pasteur_3 \
-l "TIPP, Quikr, MP2.0, MetaPhyler, mOTU, CLARK, FOCUS" \
-o output_dir

Running opal.py using Docker

Download or git-clone OPAL from GitHub. In OPAL's directory, build the Docker image with the command:

docker build -t opal:latest .

opal.py can then be run with the docker run command. Example:

docker run -v $(pwd):/host opal \
opal.py -g /host/data/goldstandard_low_1.bin \
/host/data/cranky_wozniak_13 \
/host/data/grave_wright_13 \
/host/data/furious_elion_13 \
/host/data/focused_archimedes_13 \
/host/data/evil_darwin_13 \
/host/data/agitated_blackwell_7 \
/host/data/jolly_pasteur_3 \
-l "TIPP, Quikr, MP2.0, MetaPhyler, mOTU, CLARK, FOCUS" \
-o /host/output_dir

Running tsv2biom.py

usage: tsv2biom.py [-h] -o OUTPUT_FILE [-j] files [files ...]

Convert profile in the CAMI Bioboxes format to BIOM

positional arguments:
  files                 Input file(s), one file per sample

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        Output file
  -j, --json            Output in json (default: hdf5)

Example:

python3 tsv2biom.py data/cranky_wozniak_13 -o output_dir/cranky_wozniak_13.biom

Measuring runtime and maximum main memory usage

To measure the runtime and maximum main memory usage of a taxonomic profiler using OPAL, it must be converted to a Biobox docker image. Several Bioboxes are already available on Docker Hub (see Examples page).

To build your own Biobox, general instructions are available at http://bioboxes.org/. Most importantly, the Biobox of a profiler must satisfy specific input and output formats (see section Inputs above). Helpful examples of scripts and Dockerfiles are available at https://github.com/CAMI-challenge/docker_profiling_tools.

OPAL's tools to measure runtime and maximum main memory usage are:

See example usage of these tools in the Examples page.

Runtimes and memory usages can also be manually provided to opal.py using options --time and --memory. They will then be incorporated in the results files and the HTML report.

More examples

See Examples page.

Developer Guide

We are using tox for project automation.

Tests

If you want to run tests, just type the following in the project's root directory:

tox

Citation

Please cite:

Part of OPAL's functionality was described in the CAMI manuscript. Thus please also cite:

or