andrewgull/MGERT - Githubissues

Mobile Genetic Elements Retrieving Tool - MGERT

MGERT is a computational pipeline for easy retrieving of MGE's coding sequences of a particular family from genome assemblies. MGERT utilizes several established bioinformatic tools combined into single pipeline which hides different technical quirks from an inexperienced user.

Table of contents:

Requirements
Short description
Flowchart of MGERT pipeline
Installation
Usage examples
Output description
List of arguments
Citation

Requirements

RepeatModeler 1.0.11
RepeatMasker open-4.0.7
bedtools v2.27.0
RPS-blast v2.7.1+
ORFfinder v0.4.1 - direct FTP-link:

ftp://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/ORFfinder/linux-i64/ORFfinder.gz
awk (is a standard feature of most Unix-like operating systems)
Python 3.5 or higher
Python libraries (should be easy to install via pip or anaconda):

Short description

The pipeline includes five steps:

de novo search for all MGEs in the genome assembly with RepeatModeler. This step results in a set of consensus sequences for every MGE class/family found (in fasta format). Note, that the classification of the consensuses is made by the RECON package (as the part of RepeatModeler pipeline), and you can retrieve only those MGEs that were classified.
collecting particular consensuses and search for their matches in the genome assembly (using RepeatMasker).
excising of found matches from the genome assembly according to coordinates in the annotation table from previous step.
search only for those sequences that contain Conserved Domains (CD), Open Reading Frame (ORF) and CD within this ORF (via successive runs of RPS-blast and ORFinder)
adding flanking regions to each sequence with the CD-encoding ORF.

You may run the pipeline from any of these steps, for instance, if you have your own MGEs library to search in the genome. During the steps 3 & 4 the pipeline creates several diagnostic plots and calculates descriptive statistics on the found sequences (number, mean length, sd, median length, 25th and 75th length percentiles).

Flowchart of MGERT pipeline

flowchart

Installation

Clone the repository and run installation script (requires administrator permissions):

git clone https://github.com/andrewgull/MGERT && cd MGERT
sudo ./install.sh

If you can't run installation script with sudo, you could place MGERT.py wherever you want, but you will be prompt to enter a path to test_dataset.tgz when running MGERT.py --test (see below for details).

Usage examples

Preparation steps

First, run configuration script with the following command:

MGERT.py --configure

This command will create a configuration file config.json with all the necessary paths (see "Requirements" section) and filenames MGERT uses. MGERT will try to find all the paths automatically. Unless it couldn't find them, it will prompt a user to enter a path or a filename.

After the configuration step you may run MGERT with the option --test to check out whether everything works as it supposed to on a toy data set (despite the size of the dataset, it can take a while).

To validate ORFs of found TEs fast, you should create a local version of Conserved Domain Database (CDD). To do this, download full Conserved Domain collection from the NCBI website: follow the link to the Conserved Domain Database, click on the Conserved Domains menu and choose FTP in the drop-down list. You will be redirected to the FTP site where you will find cdd.tar.gz archive. You can download it using either browser or command line utility like wget (in the latter case use the direct link ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/cdd.tar.gz).
To figure out what filename you need to extract corresponding PSSM file from the archive, go to NCBI CDD, type in the name of domain of interest (e.g. "RT") click Search button and see a list of related domains (e.g. "RT_like", "RT_pepA17", "RT_nLTR_like" etc). Clicking on any entry, you will see a short description, a hierarchy of related domains and their PSSM codes ("cd00304" for "RT_like") - and this code is exactly what you need. So, to extract PSSM file for RT-domain, run the following command:

tar -zxvf cdd.tar.gz cd00304.smp

Create a simple CSV file (either comma or TAB delimited) that specifies PSSM-file - domain correspondence. Below you can see the correspondence file used for Penelope retroelements analysis:
```
cd00304.smp  RT
cd01648.smp  RT
pfam00078.smp    RT
pfam07727.smp    RT
pfam13966.smp    RT
cd01644.smp  RT
cd01709.smp  RT
cd10442.smp  EN
cd00719.smp  EN 
```
These files are used by MGERT to report only those ORFs that encodes for both domains (RT and EN) regardless of what actual PSSM file produced a hit.
Finally, make local CDD: put PSSM files (with *smp extension) to your working directory along with a CSV file specifying file - domain correspondence, and run MGERT with the following flag:

MGERT.py --make-cdd

This command will create a directory LocalCDD with all the necessary files inside it and the path to this CDD will be added to the config.json.

Now you can run the pipeline.

Full pipeline run starting from de novo MGE search using RepeatModeler

The shortest way is to run MGERT with all-default parameters (see "Parameters" section):

MGERT.py --mge-type Penelope --assembly genome.fna.gz

This command runs search and retrieving of Penelope retrotransposons' ORFs and flanking regions in the genome assembly.

Note, that for subsequent runs of MGERT, you don't need to move and gzip the genome assembly file again. The only thing you should care about is that the name of the directory where the assembly file is, were the same as the name of assembly itself (except extension) e.g. ./genome/genome.fna

Pipeline run from an arbitrary step

There are three possible steps to run the pipeline from (except the default one):

consensus step

Let's consider the situation when you already have a repeat library called, say, Penelope_consensi.fasta and you want to find instances of the repeats from the library in your assembly, and therefore there is no need to run de novo part of the pipeline. In this case simply type in the following command:

MGERT.py --assembly genome.fna --mge-type Penelope  --from-stage cons --lib Penelope_consensi.fasta

If consensus library is not specified, it will be automatically generated from the RepeatModeler output.

Furthermore, after this step a table with descriptive statistics and a histogram of repeats' lengths will be generated (shown below)

count   81848.0
mean    1380.6
std     1987.7
min     12.0
25%     152.0
50%     446.0
75%     1945.0
max     27686.0

where:

count - number of found repeats/hits;
mean - mean length of found repeat/hit;
std - standard deviation;
min - minimal length of found repeat/hit;
25% - the 25th percentile (the 1st quartile) of length of found repeats/hits;
50% - the 50th percentile (the 2nd quartile or median) of length of found repeats/hits;
75% - the 75th percentile (the 3rd quartile) of length of found repeats/hits;
max - maximum length of found repeats/hits.

histogram

coordinates step

In case when you have coordinates of repeats' matches (e.g. from previous step) - either .out or .bed file - you can run MGERT as follows:

MGERT.py --assembly genome.fna --mge-type MGE --from-stage coords --rm-table genome.fna.out

After this step a table with descriptive statistics and a histogram of repeats' lengths will be generated.

ORFs step

In case when you have TEs sequences (in FASTA format, normally it's the output of the previous step) and want to find ORFs with conserved domains, run the following command:

MGERT.py --assembly genome.fna --mge-type MGE --from-stage orfs --sequence MGE_sequences.fasta

After this step a table with descriptive statistics and a histogram of repeats' lengths will be generated as well.

flanks step

Is useful if you want to add flanking regions of certain length to ORFs.

MGERT.py --assembly genome.fna --mge-type MGE --from-stage flanks

Note, that at this step input is taken automatically from the config file what is OK if the previous step was done, otherwise MGERT will prompt you to enter the path to the fasta with ORFs. Also, input ORFs must be the ones that ORFfinder produces, cause their headers contain all the information MGERT requires to excise them from a genome.

Define the stage after which the pipeline should stop.

Say, you want to run RepeatModeler only in order to check what types of TEs it will find. In this case run the command:

MGERT.py --assembly genome.fna.gz --to-stage rmod

MGERT.py --check-types ./genome/consensi.fa.classified

To stop MGERT after RepeatMasker run, use:

MGERT.py --assembly genome.fna.gz -T Penelope --to-stage coordinates

Output description

MGERT outputs the following files:

rmod step:
- RepModOut directory with all the files produced by the RepeatModeler
- consensi.fa.classified - a fasta file with all the TEs' consensus sequences found by the RepeatModeler
- All_TE_consensi.fa - a fasta file with consensus sequences of specified TE (input for the following step)
- Unknown_consensi.fa.classified - fasta file with unclassified consensus sequences
- Unknown_classified.fa - a fasta file with unknown consensus sequences classified by the CENSOR software
consensus step:
- TE_genome.fna.RMout.txt - stdout of RepeatMasker
- TE_genome.fna.masked - genome.fna with masked TE hits
- TE_genome.fna.out - a table with TE hits coordinates (input for the following step)
- TE_genome.fna.ori.out - a table with TE hits coordinates without RepeatMasker merging (alternative input for the following step)
- TE_genome.fna.out.gff - an annotation of the TE hits in GFF format
- TE_genome.fna.tbl - a table with number of bases masked by RepeatMasker (TE abundance percentage)
- genome.fna.fai - genome index
- TE_genome.fna.out.bed - a table with TE hits coordinates in BED format (alternative input for the following step)
coordinates step:
- TE_excised_matchesX.fa - a fasta with TE hits retrieved from the genome. X stands for --merge value (input for the following step)
- TE_excised_matchesX.png - a histogram of length distribution of corresponding sequences
- TE_excised_matchesX.stats - a table with descriptive statistics
orfs step:
- TE_matches_with_hits_eX.fa - fasta file with those matches (from previous step) that contain CD hits. X stands for e-value cut off
- TE_matches_with_hits_eX.png - a histogram of length distribution of corresponding sequences
- TE_matches_with_hits_eX.stats - a table with descriptive statistics
- TE_cdsX_with_CD_eZ.fa - a fasta file with TE's ORFs of min length X and e-value cut off Z (input for the following step)
- TE_cdsX_with_CD_eZ.faa - a fasta file with TE's translated ORFs of min length X and e-value cut off Z
- TE_cdsX_with_CD_eZ.png - a histogram of length distribution of corresponding nucleotide sequences
- TE_cdsX_with_CD_eZ.stats - a table with descriptive statistics
flanks step (optional):
- TE_cdsX_with_CD_eZ_extended_LaRb.fa - a fasta file with TE's ORFs of min length X and e-value cut off Z, left flanking sequence of length a and right flanking sequence of length b

List of arguments

Required arguments

-a, --assembly - specify a genome assembly file (e.g. genome.fa.gz); this argument is mandatory on all stages since it indicates where the working directory is.

-T, --mge-type - specify the type of MGE to search (e.g. L1/BovB/RTE/CR1/LINE/Penelope/DIRS etc.)

Configuration arguments

-configure - run the configuration script

--make-cdd - make local CDD

Optional arguments

--test - run self-test after configuration on a toy data set

-cd, --cd-table - specify a path to a comma or tab delimited table of SMP files and their grouping (e.g. domains.csv). CSV extension is mandatory.

-f, --from-stage - specify the step from which the pipeline should start: 'consensus' - get consensus sequences; 'coords' - get sequences; 'orfs' - get ORFs; 'flanks' - add flanking sequences to CDS (default - rmod).

-S, --to-stage - specify the step (rmod, consensus, coords, orfs or flanks) at which the pipeline should finish (default - flanks)

-k, --check-types - print out all the types of MGE found in the RepeatModeler output (e.g. ./consensi.fa.classified).

-t, --threads - set number of threads (default - all available CPUs).

-C, --censor - provide a path to CENSOR classification results (HTML file or URL).

-o, --ori - if specified MGERT will use the *.ori file to fetch the coordinates instead of *_rm.out file.

-m, --merge - merge all hits within that number of bp into a single entry. Default 0 bp (i.e. no merge).

-e, --e-value - set expectation value (E) for RPS-BLAST. Default 0.01.

-c, --start-codon - ORF start codon to use. 0 = 'ATG' only; 1 = 'ATG' and alternative initiation codons; 2 = any sense codon; Default 0.

-l, --min-length - set minimum length of ORF to be reported, default 1000 bp.

-s, --strand - output ORFs on specified strand only (e.g. plus/minus/both). Default 'plus'.

-le, --left-end - set length of left (5') flanking region. Default 0 bp.

-re, -right-end - set length of right (3') flanking region. Default 0 bp (if both le and re are set to 0, flanks mode will be omitted).

-L, --rm-library - specify a path to a library for RepeatMasker (in FASTA format). Use with -f consensus only. If consensus library is not specified, it will be compiled from RepeatModeler output automatically.

-rm, --rm-table - specify repeat masker table to use (with *_rm.out or *.bed extension). Use with -f coords option only.

-sq, --sequence - provide a path to a file of sequences where to look for conservative domains. Use with -f orf option only.

-v, --version - show program's version number and exit.

Citation

Please, use the following citation, if you used MGERT in your research:

Guliaev AS, Semyenova SK. MGERT: a pipeline to retrieve coding sequences of mobile genetic elements from genome assemblies. Mob DNA. 2019;10(1):21. doi:10.1186/s13100-019-0163-6

Shortened URL to access the full-text article: https://rdcu.be/bCnQx