GroundB / RepeatDefeaters

MIT License
1 stars 2 forks source link

RepeatDefeaters - Utilities for unclassified consensus sequences

Table of Contents

Overview

Motivation

For recently sequenced non-model organisms, repeat discovery tools often fail to classify a large portion of their repeats. These unclassified repeats can be host genes that were duplicated, or TEs that are just not present in the databases. Analyses of Transposable Elements (TE) can be misleading if host genes are present. RepeatDefeaters is a tool to further classify which repeats should be considered host genes or TEs.

A critical task during repeat discovery is to accurately annotate TEs. This can be challenging for new organisms that have few references to compare with. RepeatDefeaters aims to provide an easy-to-follow guideline on how to tackle these consensus sequences that are difficult to classify automatically.

Key features

RepeatDefeaters provides:

  1. Utilities that work together to determine if your consensus sequence of interest is related to TE activity.

TE activity

A list of keywords which suggest TE activity has been included in the file assets/pfam_te_domain_keywords.txt. Most of these keywords are known transposon protein domains while other keywords are plainly as "virus", "viral", and "transpos" - for the purpose of including both "transposon" and the verb "transpose". From these keywords, a list of Pfam sequence Ids that might relate to TE activities has been generated by running the PFAM_TRANSPOSIBLE_ELEMENT_SEARCH process. The list can be found under assets/Pfam_R32.Proteins_wTE_Domains.seqid.

By setting the workflow parameter pfam_proteins_with_te_domain_list in the params configuration block in a custom config (supplied with -c), the PFAM_TRANSPOSIBLE_ELEMENT_SEARCH process can be skipped to save computation time.

params {
    pfam_proteins_with_te_domain_list = "$projectDir/assets/Pfam_R32.Proteins_wTE_Domains.seqid"
}

Usage

This workflow has been designed with portability and reproducibility in mind. The workflow is implemented using the workflow manager Nextflow which supports a wide range of execution platforms, from local execution, to HPC and the cloud. Software package managers are used to bundle software dependancies to ensure programs work in the same manner across different execution platforms.

Usage:

nextflow run \
    -params-file params.yml \
    [-c <custom.config>] \
    [-profile <executor profile>] \
    GroundB/RepeatDefeaters

where:

Dependancies

Workflow inputs

Mandatory:

Optional:

Workflow package manager options:

Uppmax cluster options:

Tool specific customisation:

The tools makeblastdb, blastx, and pfam can have their parameters modified by altering their module specific configuration in your custom.config file.

For example, to override the parameters for blastn in the TREP_BLASTN process (found inconfigs/modules.config), add the following block to your custom configuration file.

process {
    withName: 'TREP_BLASTN' { // Select the process name using the `withName` selector
        // tool non-file parameters are supplied using ext.args (ext.args2, ext.args3, ... 
        // - check the relevant module for which parameter to modify )
        ext.args = '-outfmt 6 -max_target_seqs 1 -evalue 1e-10'
    }
}

Workflow outputs

Output folders: (subfolders in the folder provided by the results parameter).

Customisation for Uppmax.

Uppmax is a set of High Performance Clusters (HPC) available to the Swedish research community. A custom profile is available to ease use on an Uppmax HPC. Nextflow will submit jobs to the slurm queue manager, use the container technology Singularity to manage software dependancies, and use the node local storage $SNIC_TMP for intermediate computations.

nextflow run \
    -params-file params.yml \
    [-c <custom.config>] \
    -profile uppmax \
    GroundB/RepeatDefeaters

The command above supplies custom configuration using the -c option, selects the uppmax configuration pipeline, and automatically downloads the workflow from https://github.com/GroundB/RepeatDefeaters.

In order to submit to slurm, a SNIC project allocation must be provided. This can be provided using the workflow parameter project. E.g., project: snic20xx-yy-zz in the params.yml file, or --project snic20xx-xx-zz on the command line.

On Uppmax systems, Nextflow needs to be loaded using either the module system:

module load bioinfo-tools Nextflow
export NXF_HOME=/proj/<snic_compute_allocation>/nextflow

or by activating a conda environment:

conda activate /proj/<snic_compute_allocation>/conda/nextflow-env

created with the command:

wget https://raw.githubusercontent.com/GroundB/RepeatDefeaters/main/nextflow_conda-env.yml
conda env create \
    --prefix "/proj/<snic_compute_allocation>/conda/nextflow-env" \
    -f nextflow_conda-env.yml