Eco-Flow / excon

A pipeline to measure gene family expansion and contraction
MIT License
3 stars 1 forks source link

A Nextflow pipeline to describe and compare genomes across species. It also performs gene epansion and contraction analysis using CAFE.

It works with any set of species that have a genome (fasta) and annotation (gff) file. (minimum of 5 species ideally up to around 15).

You can also run GO annotation (with user-provided GO files, or with GO files semi-automatically downloaded from Ensembl biomart). This is then used to check what GO terms are associated with expanded or contracted gene sets (from CAFE).

The general pipeline logic is as follows:

Installation

Nextflow pipelines require a few prerequisites. There is further documentation on the nf-core webpage here, about how to install Nextflow.

Prerequisites

Install

To install the pipeline please use the following commands but replace VERSION with a release.

wget https://github.com/Eco-Flow/excon/archive/refs/tags/VERSION.tar.gz -O - | tar -xvf -

or

curl -L https://github.com/Eco-Flow/excon/archive/refs/tags/VERSION.tar.gz --output - | tar -xvf -

This will produce a directory in the current directory called excon-VERSION which contains the pipeline.

Inputs

Required

This csv can take 2 forms:

Please Note: The genome has to be chromosome level not contig level.

2 fields (Name,Refseq_ID):

Drosophila_yakuba,GCF_016746365.2
Drosophila_simulans,GCF_016746395.2
Drosophila_santomea,GCF_016746245.2

3 fields (Name,genome.fna,annotation.gff):

Drosophila_yakuba,data/Drosophila_yakuba/genome.fna.gz,data/Drosophila_yakuba/genomic.gff.gz
Drosophila_simulans,data/Drosophila_simulans/genome.fna.gz,data/Drosophila_simulans/genomic.gff.gz
Drosophila_santomea,data/Drosophila_santomea/genome.fna.gz,data/Drosophila_santomea/genomic.gff.gz

Optional

Profiles

This pipeline is designed to run in various modes that can be supplied as a comma separated list i.e. -profile profile1,profile2.

Container Profiles

Please select one of the following profiles when running the pipeline.

Optional Profiles

Custom Configuration

If you want to run this pipeline on your institute's on-premise HPC or specific cloud infrastructure then please contact us and we will help you build and test a custom config file. This config file will be published to our configs repository.

Running the Pipeline

Please note: The -resume flag uses previously cached successful runs of the pipeline.

  1. Example run the full test example data:
nextflow run main.nf -resume -profile docker,test_small

Settings in test_small: input = "input_small-s3.csv" predownloaded_fasta = "s3://excon/data/Insect_data/fasta/" predownloaded_gofiles = "s3://excon/data/Insect_data/gofiles/"

For the fastest run use: nextflow run main.nf -resume -profile docker,test_bacteria

  1. To run on your own data (minimal run), cafe only.
nextflow run main.nf -resume -profile docker --input data/input_small-s3.csv
  1. To run on your own data with GO enrichment analysis (using predownloaded fasta/go files for GO assignment)
nextflow run main.nf -resume -profile docker --input data/input_small-s3.csv \|
--predownloaded_fasta 's3://excon/data/Insect_data/fasta/*' --predownloaded_gofiles 's3://excon/data/Insect_data/gofiles/*' 
  1. To run on your own data with GO enrichment analysis + retrieval of GO assignment species

If you do not have GO files to run GO enrichment, you can run the following code to semi-auto download them from NCBI biomart.

You first need to go to Ensembl Biomart to find the species IDs you want to use to assign GO terms to your species. Ideally you should choose one or more species that are closely related and have good GO annotations.

i) To check what species are present and their species name codes you need to download the biomaRt library in R (for metazoa):

library(biomaRt)
ensembl <- useEnsembl(biomart = "metazoa_mart", host="https://metazoa.ensembl.org")
datasets <- listDatasets(ensembl)
datasets

You will see something like:

                       dataset
1     aagca019059575v1_eg_gene
2       aagca914969975_eg_gene
3     aagca933228735v1_eg_gene
4           aalbimanus_eg_gene
5          aalbopictus_eg_gene
6            aalvpagwg_eg_gene

The dataset IDs are what you need to enter into the Nextflow script.

For mammals:

ensembl <- useEnsembl(biomart = "genes", host="https://ensembl.org")

Then you can run the excon script as follows:

nextflow run main.nf -resume -profile <apptainer/docker/singularity> --input data/input_small-s3.csv --ensembl_biomart "metazoa_mart" --ensembl_dataset "example.txt"

where example.txt is a file of dataset IDs from ensembl biomart (as shown above), separated by newline characters.

e.g.:

aagca019059575v1_eg_gene
aagca914969975_eg_gene
aagca933228735v1_eg_gene

Citation

This pipeline is not yet published. Please contact us if you wish to use our pipeline, we are happy to help you run the pipeline.

Contact Us

If you need any support do not hesitate to contact us at any of:

c.wyatt [at] ucl.ac.uk

ecoflow.ucl [at] gmail.com