emo-bon / MetaGOflow

MGnify oriented implementation for the Marine Genomic Observatories oriented pipeline, developed in the framework of an EOSC-Life funded project
Apache License 2.0
7 stars 8 forks source link

metaGOflow: A workflow for marine Genomic Observatories' data analysis


An EOSC-Life project

The workflows developed in the framework of this project are based on pipeline-v5 of the MGnify resource.

This branch is a child of the pipeline_5.1 branch that contains all CWL descriptions of the MGnify pipeline version 5.1.


To run metaGOflow you need to make sure you have the following set on your computing environmnet first:

Storage while running

Depending on the analysis you are about to run, disk requirements vary. Indicatively, you may have a look at the metaGOflow publication for computing resources used in various cases.


Get the EOSC-Life marine GOs workflow

git clone https://github.com/emo-bon/MetaGOflow
cd MetaGOflow

Download necessary databases (~235GB)

You can download databases for the EOSC-Life GOs workflow by running the download_dbs.sh script under the Installation folder.

bash Installation/download_dbs.sh -f [Output Directory e.g. ref-dbs] 

If you have one or more already in your system, then create a symbolic link pointing at the ref-dbs folder or at one of its subfolders/files.

The final structure of the DB directory should be like the following:

user@server:~/MetaGOflow: ls ref-dbs/
db_kofam/  diamond/  eggnog/  GO-slim/  interproscan-5.57-90.0/  kegg_pathways/  kofam_ko_desc.tsv  Rfam/  silva_lsu/  silva_ssu/

How to run

We recommend utilizing Conda to create a virtual environment. We provide a Conda environment file that includes the necessary dependencies.

Set up the environment

Run once - Setup environment

This will create a conda env called metagoflow.

conda env create -f conda_environment.yml

Run every time

conda activate metagoflow

Run the workflow

Using Singularity

./run_wf.sh -s -n osd-short \
-d short-test-case \
-f test_input/wgs-paired-SRR1620013_1.fastq.gz \
-r test_input/wgs-paired-SRR1620013_2.fastq.gz
Using a cluster with a queueing system (e.g. SLURM)

Using Docker

./run_wf.sh -n osd-short -d short-test-case \
-f test_input/wgs-paired-SRR1620013_1.fastq.gz \
-r test_input/wgs-paired-SRR1620013_2.fastq.gz

HINT: If you are using Docker, you may need to run the above command without the `-s' flag.

Testing samples

The samples are available in the test_input folder.

We provide metaGOflow with partial samples from the Human Metagenome Project (SRR1620013 and SRR1620014) They are partial as only a small part of their sequences have been kept, in terms for the pipeline to test in a fast way.

Hints and tips

  1. In case you are using Docker, it is strongly recommended to avoid installing it through snap.

  2. RuntimeError: slurm currently does not support shared caching, because it does not support cleaning up a worker after the last job finishes. Set the --disableCaching flag if you want to use this batch system.

  3. In case you are having errors like:

cwltool.errors.WorkflowException: Singularity is not available for this tool

You may run the following command:

singularity pull --force --name debian:stable-slim.sif docker://debian:stable-sli


To make contribution to the project a bit easier, all the MGnify conditionals and subworkflows under the workflows/ directory that are not used in the metaGOflow framework, have been removed.
However, all the MGnify tools/ and utils/ are available in this repo, even if they are not invoked in the current version of metaGOflow. This way, we hope we encourage people to implement their own conditionals and/or subworkflows by exploiting the currently supported tools and utils as well as by developing new tools and/or utils.