MGERT is a computational pipeline for easy retrieving of MGE's coding sequences of a particular family from genome assemblies. MGERT utilizes several established bioinformatic tools combined into single pipeline which hides different technical quirks from an inexperienced user.
Table of contents:
ORFfinder v0.4.1 - direct FTP-link:
ftp://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/ORFfinder/linux-i64/ORFfinder.gz
awk (is a standard feature of most Unix-like operating systems)
Python 3.5 or higher
Python libraries (should be easy to install via pip or anaconda):
The pipeline includes five steps:
You may run the pipeline from any of these steps, for instance, if you have your own MGEs library to search in the genome. During the steps 3 & 4 the pipeline creates several diagnostic plots and calculates descriptive statistics on the found sequences (number, mean length, sd, median length, 25th and 75th length percentiles).
Clone the repository and run installation script (requires administrator permissions):
git clone https://github.com/andrewgull/MGERT && cd MGERT
sudo ./install.sh
If you can't run installation script with sudo, you could place MGERT.py
wherever you want, but you will be prompt
to enter a path to test_dataset.tgz
when running MGERT.py --test
(see below for details).
MGERT.py --configure
This command will create a configuration file config.json with all the necessary paths (see "Requirements" section) and filenames MGERT uses. MGERT will try to find all the paths automatically. Unless it couldn't find them, it will prompt a user to enter a path or a filename.
After the configuration step you may run MGERT with the option --test
to check out whether everything works as
it supposed to on a toy data set (despite the size of the dataset, it can take a while).
wget
(in the latter case use the direct link ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/cdd.tar.gz
).tar -zxvf cdd.tar.gz cd00304.smp
Create a simple CSV file (either comma or TAB delimited) that specifies PSSM-file - domain correspondence. Below you can see the correspondence file used for Penelope retroelements analysis:
cd00304.smp RT
cd01648.smp RT
pfam00078.smp RT
pfam07727.smp RT
pfam13966.smp RT
cd01644.smp RT
cd01709.smp RT
cd10442.smp EN
cd00719.smp EN
These files are used by MGERT to report only those ORFs that encodes for both domains (RT and EN) regardless of what actual PSSM file produced a hit.
Finally, make local CDD: put PSSM files (with *smp extension) to your working directory along with a CSV file specifying file - domain correspondence, and run MGERT with the following flag:
MGERT.py --make-cdd
This command will create a directory LocalCDD with all the necessary files inside it and the path to this CDD will be added to the config.json.
Now you can run the pipeline.
The shortest way is to run MGERT with all-default parameters (see "Parameters" section):
MGERT.py --mge-type Penelope --assembly genome.fna.gz
This command runs search and retrieving of Penelope retrotransposons' ORFs and flanking regions in the genome assembly.
Note, that for subsequent runs of MGERT, you don't need to move and gzip the genome assembly file again. The only thing you
should care about is that the name of the directory where the assembly file is, were the same as the name of assembly itself
(except extension) e.g. ./genome/genome.fna
There are three possible steps to run the pipeline from (except the default one):
Let's consider the situation when you already have a repeat library called, say, Penelope_consensi.fasta
and you want to find instances of the repeats from the library in your assembly, and therefore there is no need to run de novo part of the pipeline. In this case simply type in the following command:
MGERT.py --assembly genome.fna --mge-type Penelope --from-stage cons --lib Penelope_consensi.fasta
If consensus library is not specified, it will be automatically generated from the RepeatModeler output.
Furthermore, after this step a table with descriptive statistics and a histogram of repeats' lengths will be generated (shown below)
count 81848.0
mean 1380.6
std 1987.7
min 12.0
25% 152.0
50% 446.0
75% 1945.0
max 27686.0
where:
In case when you have coordinates of repeats' matches (e.g. from previous step) - either .out or .bed file - you can run MGERT as follows:
MGERT.py --assembly genome.fna --mge-type MGE --from-stage coords --rm-table genome.fna.out
After this step a table with descriptive statistics and a histogram of repeats' lengths will be generated.
In case when you have TEs sequences (in FASTA format, normally it's the output of the previous step) and want to find ORFs with conserved domains, run the following command:
MGERT.py --assembly genome.fna --mge-type MGE --from-stage orfs --sequence MGE_sequences.fasta
After this step a table with descriptive statistics and a histogram of repeats' lengths will be generated as well.
Is useful if you want to add flanking regions of certain length to ORFs.
MGERT.py --assembly genome.fna --mge-type MGE --from-stage flanks
Note, that at this step input is taken automatically from the config file what is OK if the previous step was done, otherwise MGERT will prompt you to enter the path to the fasta with ORFs. Also, input ORFs must be the ones that ORFfinder produces, cause their headers contain all the information MGERT requires to excise them from a genome.
Say, you want to run RepeatModeler only in order to check what types of TEs it will find. In this case run the command:
MGERT.py --assembly genome.fna.gz --to-stage rmod
MGERT.py --check-types ./genome/consensi.fa.classified
To stop MGERT after RepeatMasker run, use:
MGERT.py --assembly genome.fna.gz -T Penelope --to-stage coordinates
MGERT outputs the following files:
-a, --assembly
- specify a genome assembly file (e.g. genome.fa.gz); this argument is mandatory on all stages since it
indicates where the working directory is.
-T, --mge-type
- specify the type of MGE to search (e.g. L1/BovB/RTE/CR1/LINE/Penelope/DIRS etc.)
-configure
- run the configuration script
--make-cdd
- make local CDD
--test
- run self-test after configuration on a toy data set
-cd, --cd-table
- specify a path to a comma or tab delimited table of SMP files and their grouping (e.g. domains.csv). CSV extension is mandatory.
-f, --from-stage
- specify the step from which the pipeline should start: 'consensus' - get consensus sequences;
'coords' - get sequences; 'orfs' - get ORFs; 'flanks' - add flanking sequences to CDS (default - rmod).
-S, --to-stage
- specify the step (rmod, consensus, coords, orfs or flanks) at which the pipeline should finish (default - flanks)
-k, --check-types
- print out all the types of MGE found in the RepeatModeler output (e.g. ./consensi.fa.classified
).
-t, --threads
- set number of threads (default - all available CPUs).
-C, --censor
- provide a path to CENSOR classification results (HTML file or URL).
-o, --ori
- if specified MGERT will use the *.ori
file to fetch the coordinates instead of *_rm.out
file.
-m, --merge
- merge all hits within that number of bp into a single entry. Default 0 bp (i.e. no merge).
-e, --e-value
- set expectation value (E) for RPS-BLAST. Default 0.01.
-c, --start-codon
- ORF start codon to use. 0 = 'ATG' only; 1 = 'ATG' and alternative initiation codons; 2 = any sense codon; Default 0.
-l, --min-length
- set minimum length of ORF to be reported, default 1000 bp.
-s, --strand
- output ORFs on specified strand only (e.g. plus/minus/both). Default 'plus'.
-le, --left-end
- set length of left (5') flanking region. Default 0 bp.
-re, -right-end
- set length of right (3') flanking region. Default 0 bp (if both le and re are set to 0, flanks mode will be omitted).
-L, --rm-library
- specify a path to a library for RepeatMasker (in FASTA format). Use with -f consensus
only.
If consensus library is not specified, it will be compiled from RepeatModeler output automatically.
-rm, --rm-table
- specify repeat masker table to use (with *_rm.out
or *.bed
extension). Use with -f coords
option only.
-sq, --sequence
- provide a path to a file of sequences where to look for conservative domains. Use with -f orf
option only.
-v, --version
- show program's version number and exit.
Please, use the following citation, if you used MGERT in your research:
Guliaev AS, Semyenova SK. MGERT: a pipeline to retrieve coding sequences of mobile genetic elements from genome assemblies. Mob DNA. 2019;10(1):21. doi:10.1186/s13100-019-0163-6
Shortened URL to access the full-text article: https://rdcu.be/bCnQx